覆盖

构建多模态

生成式人工智能和

智能体应用将概念转化为代码,以面向未来的多模态

和高级智能体GenAI应用

Building Multimodal

Generative AI and

Agentic ApplicationsShaping concept to code for the future of multimodal

and advanced agentic GenAI applications

因德拉吉特·卡尔

Indrajit Kar

www.bpbonline.com

www.bpbonline.com

第一版 2026

First Edition 2026

版权所有 © BPB Publications,印度

Copyright © BPB Publications, India

ISBN:978-93-65898-385

ISBN: 978-93-65898-385

版权所有。未经出版商事先书面许可,不得以任何形式或任何方式复制、分发或传播本出版物的任何部分,也不得将其存储在数据库或检索系统中。但程序清单除外,程序清单可以输入、存储和在计算机系统中执行,但不得以出版、影印、录制或任何电子和机械方式复制。

All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means.

责任限制和免责声明

LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY

本书所载信息据作者和出版商所知均属真实准确。作者已尽一切努力确保本书内容的准确性,但出版商不对因本书任何信息而造成的任何损失或损害承担责任。

The information contained in this book is true and correct to the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but the publisher cannot be held responsible for any loss or damage arising from any information in this book.

本书中提及的所有商标均归其各自所有者所有,但 BPB 出版社无法保证此信息的准确性。

All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information.

版权

www.bpbonline.com

www.bpbonline.com

献给

Dedicated to

我的父母、我的妻子、我的孩子,还有母亲

My parents, my wife, my kids, and the mother

作者简介

About the Author

Indrajit Kar是一位杰出的AI思想领袖,著有5本AI/ML书籍,是一位创新者和作家,拥有超过22年的经验,致力于推动各行业变革性的AI产品和平台。在他的职业生涯中,他领导过众多高影响力团队,负责开发AI、ML、GenAI和数据科学领域的端到端解决方案,指导项目从概念化和设计到部署和扩展的整个过程。

Indrajit Kar is a distinguished AI thought leader, author of 5 AI/ML books, innovator, and author with over 22 years of experience driving transformative AI-led products and platforms across industries. Throughout his career, he has led numerous high-impact teams responsible for developing end-to-end solutions in AI, ML, GenAI, and data science - guiding projects from conceptualization and design to deployment and scaling.

作为现任人工智能主管,Indrajit 领导着多个大型项目,为众多全球客户带来可衡量的业务影响。他的工作根植于其在全人类人工智能 (GenAI)、大型语言模型( LLM ) 架构、MLOps、自然语言处理和计算机视觉等领域的深厚技术专长。他在将 LLM 和自主人工智能代理集成到电子商务、医疗保健、生命科学、电信和制造业等多个行业的实际应用中发挥了关键作用。

In his current role as head of AI, Indrajit spearheads large-scale initiatives that deliver measurable business impact across a diverse portfolio of global clients. His work is rooted in deep technical expertise across GenAI, large language model (LLM) architectures, MLOps, natural language processing, and computer vision. He has played a key role in integrating LLMs and autonomous AI agents into real-world applications spanning sectors such as e-commerce, healthcare, life sciences, telecommunications, and manufacturing.

Indrajit 同时也是企业高管的战略顾问和合作伙伴,致力于帮助企业通过先进的人工智能产品和平台转型释放商业价值。他始终引领潮流,弥合前沿研究与企业级实施之间的鸿沟,加速人工智能在各组织中的应用。

Indrajit is also a strategic advisor and collaborator to C-level executives, helping enterprises unlock business value through advanced AI product and platform transformations. His leadership consistently bridges the gap between cutting-edge research and enterprise-scale implementation, accelerating AI adoption across organizations.

作为人工智能领域公认的权威人物,Indrajit著有两本书,其中一本专门探讨了基因人工智能及其行业应用。他在人工智能研究领域也贡献卓著,发表了27篇以上的论文,申请了21项专利,并荣获多项殊荣,包括来自知名会议和机构的八项最佳论文奖。他的研究常常探索创新、可扩展性和负责任的人工智能之间的交集。

A recognized voice in the AI community, Indrajit has authored two books, including one dedicated to GenAI and its industry applications. He has also contributed extensively to AI research, with 27+ published papers, 21 patents filed, and multiple accolades, including eight Best Paper Awards from reputed conferences and institutions. His work often explores the intersections of innovation, scalability, and responsible AI.

凭借在领导研发项目以及为财富 500 强企业管理人工智能服务和产品化工作的丰富经验,Indrajit 将继续引领智能系统的未来发展。他对创新的热情,以及对符合伦理且可扩展的人工智能的愿景,驱动着他通过变革性技术赋能企业和社区的使命。

With a legacy of leading R&D programs and having managed AI services and productization efforts for Fortune 500 companies, Indrajit continues to shape the future of intelligent systems. His passion for innovation, combined with a vision for ethical and scalable AI, drives his mission to empower businesses and communities through transformative technology.

关于评论者

About the Reviewers

  • Dhanveer Singh 是美国第一资本银行的一位技术领导者,在金融服务、保险和零售行业拥有超过 19 年的软件工程、云架构和大规模系统现代化经验。他擅长 AWS、微服务、容器化、DevOps、大数据和人工智能/机器学习,致力于打造安全、高性能的平台,处理数十亿笔交易,服务全球数百万用户。

    作为云原生架构和自动化领域的倡导者,Dhanveer 领导了云成本优化、弹性工程和网络安全自动化方面的变革性举措,显著提升了效率并推动了企业数字化转型。他还拥有多项数据集成、转换、数据安全和云自动化领域的专利,彰显了他对创新的重视。

    除了在技术领导方面做出贡献外,他还担任国际期刊和会议的审稿人和技术程序委员会成员,担任全球 IT 和网络安全奖项的评委,并通过 STEM 和 CodeDay 项目提供指导。

    Dhanveer 是 IETE 和 IAENG 的会士,也是 IEEE 和 ACM 的活跃成员。

  • Dhanveer Singh is a technology leader at Capital One USA with over 19 years of experience in software engineering, cloud architecture, and large-scale system modernization across financial services, insurance, and retail. He specializes in AWS, microservices, containerization, DevOps, big data, and AI/ML, delivering secure, high-performing platforms that process billions of transactions and serve millions worldwide.

    An advocate of cloud-native architectures and automation, Dhanveer has led transformative initiatives in cloud cost optimization, resilience engineering, and cybersecurity automation, driving measurable efficiency and advancing enterprise digital transformation. He has also filed multiple patents in areas of data integration, transformation, data security, and cloud automation, underscoring his focus on innovation.

    Beyond his technical leadership, he contributes as a reviewer and TPC member for international journals and conferences, serves as a judge for global IT and cybersecurity awards, and mentors through STEM and CodeDay programs.

    Dhanveer is a Fellow of IETE and IAENG, and an active IEEE and ACM member.

  • 哈文德拉·辛格是一位杰出的技术领袖,专长于云工程、架构、自动化和人工智能解决方案。他运用 Azure、.NET、C#、Python、GCP、Kubernetes、Databricks 和其他前沿技术,设计并实现可扩展、安全的系统。凭借在云原生应用、微服务、事件驱动架构和分布式系统方面的专业知识,哈文德拉致力于推动云和人工智能生态系统的创新,提供能够创造商业价值并实现可持续增长的高影响力解决方案。

  • Harvendra Singh is a distinguished technology leader specializing in cloud engineering, architecture, automation, and AI-powered solutions. He designs and implements scalable, secure systems utilizing Azure, .NET, C#, Python, GCP, Kubernetes, Databricks, and other cutting-edge technologies. With expertise in cloud-native applications, microservices, event-driven architectures, and distributed systems, Harvendra drives innovation in cloud and AI ecosystems, delivering high-impact solutions that drive business value and sustainable growth.

  • 马尼什·贾恩 (Manish Jain)是 Firstsource Solutions 的副总裁兼人工智能架构负责人,负责领导财富 100 强企业的企业级人工智能转型。他拥有超过 20 年的技术领导经验,其中包括十余年推动先进人工智能创新,并因此被公认为能够带来可量化业务影响的变革性解决方案架构师。除了公司职责外,他还担任 Deeplearning.ai 的技术顾问,并在 Analytics Vidhya 担任导师。此外,他还为 Manning 等知名人工智能出版社审稿。以及 Packt 出版社,使他成为研究与企业实际应用交叉领域的佼佼者。Manish 兼具深厚的技术专长和卓越的领导才能,能够指导企业完成人工智能转型中的战略和运营层面的工作。他对推动人工智能社区发展的承诺体现在他的咨询和指导工作中,以及他参与同行评审期刊的发表。

    这些经验使他成为人工智能转型的必要性以及在复杂的企业环境中扩展人工智能的实际挑战方面的权威,始终将创新与可衡量的结果联系起来。

  • Manish Jain is the vice president and head of AI architecture at Firstsource Solutions, where he leads enterprise-wide AI transformation for Fortune 100 organizations. With more than 20 years of technology leadership, including over a decade driving advanced AI innovation, he has earned recognition as an architect of transformative solutions that deliver quantifiable business impact. In addition to corporate responsibilities, his acts as a technical consultant for Deeplearning.ai and mentors at Analytics Vidhya. He also serves as a manuscript reviewer for prominent AI publishers such as Manning and Packt, positioning him at the crossroads of research and practical enterprise applications. Manish is unique blend of deep technical expertise and proven executive leadership enables him to guide organizations through the strategic and operational aspects of AI transformation. His commitment to advancing the AI community is evident in his advisory and mentoring roles, as well as his involvement in peer-reviewed publishing.

    These experiences make a compelling authority on the imperatives of AI transformation and the practical challenges of scaling AI across complex enterprise environments, consistently linking innovation with measurable outcomes.

致谢

Acknowledgement

我衷心感谢我的家人、父母、妻子、岳父母和孩子们,他们坚定不移的鼓励和信任是我这段旅程的基石。衷心感谢BPB出版社的耐心和信任,使得本书得以分册出版,全面深入地涵盖了人工智能领域瞬息万变的方方面面。我也感谢我的公司,感谢他们为我提供发展机会,让我得以开发基因人工智能和智能体应用,这些都为本书分享的见解提供了宝贵的素材。对于所有支持我的人,无论你们是否在场,你们的指导和鼓励都深深地影响了我的这段旅程,对此我将永远铭记于心。

I extend my deepest appreciation to my family, parents, wife, in-laws, and children, whose steadfast encouragement and belief in me have been the cornerstone of this journey. Heartfelt thanks to BPB Publications for their patience and trust, allowing the book’s multi-part publication to thoroughly cover the dynamic field of AI. I am also grateful to my companies for fostering growth and providing opportunities to develop GenAI and agentic applications, which informed the insights shared here. To everyone who supported me, seen and unseen, your guidance and encouragement have profoundly shaped this journey, for which I am eternally thankful.

前言

Preface

我们生活在智能协作时代,人工智能不再仅仅是工具,而是能够检索知识、生成想法、推理问题并跨文本、图像和语音等多种模态进行交互的伙伴。多模态和智能体应用的出现标志着我们构建、部署和依赖人工智能的方式发生了转折。

We are living in the age of intelligent collaboration, where AI is no longer just a tool, but a partner capable of retrieving knowledge, generating ideas, reasoning through problems, and interacting across modalities like text, images, and voice. The emergence of multimodal and agentic applications marks a turning point in how we build, deploy, and rely on AI.

本书《构建多模态生成式人工智能和智能体应用》是一本实用指南,旨在帮助读者超越理论,真正构建未来人工智能系统。全书共18章,循序渐进地从基础知识入手,逐步深入到高级实现,首先介绍检索、生成和编排;然后讲解结合文本、图像和语音的多模态工作流程;最后探讨文本到SQL系统、OCR、欺诈检测和人工智能运维等实际应用。

This book, Building Multimodal Generative AI and Agentic Applications, is a practical guide for those who want to move beyond theory and actually build the future of AI systems. Across 18 chapters, you will move step-by-step from fundamentals to advanced implementations, starting with retrieval, generation, and orchestration; progressing into multimodal workflows that combine text, images, and voice; and then advancing toward real-world applications like text-to-SQL systems, OCR, fraud detection, and AI operations.

每一章都注重实践性和易懂性。您将找到概念解释、系统设计原则、代码示例以及练习题,这些练习将引导您进行实验并在实践中学习。

Every chapter is designed to be hands-on and approachable. You will find conceptual explanations, system design principles, code walkthroughs, and to do exercises that push you to experiment and learn by doing.

本书的目标不仅是解释这些系统是如何工作的,而且还要赋予你构建自己的可扩展、多模态和智能AI应用程序的能力,这些应用程序可靠、安全且具有影响力。

The goal of this book is not only to explain how these systems work, but also to empower you to build your own scalable, multimodal, and agentic AI applications, applications that are reliable, safe, and impactful.

无论您是工程师、研究人员还是技术领导者,我都希望这本书能为您提供塑造下一代人工智能所需的知识、信心和灵感。

Whether you are an engineer, researcher, or leader in technology, I hope that this book equips you with the knowledge, confidence, and inspiration to shape the next-generation of AI.

第一章:新时代生成式人工智能简介 ——本章介绍现代人工智能系统的关键组成部分。首先概述生成式人工智能,然后探讨检索系统、生成系统及其各自的优势。本章阐述了检索增强生成RAG )如何将两者结合起来,以及编排如何帮助不同的人工智能组件协同工作。此外,本章还解释了标记、向量数据库和重排序方法,以及双向编码器和交叉编码器之间的区别。最后,本章讨论了诸如安全使用人工智能的防护措施、代理的作用以及模型上下文协议的重要性等重要主题。

Chapter 1: Introducing New Age Generative AI - This chapter introduces the key building blocks of modern AI systems. It begins with an overview of generative AI and then explores retrieval systems, generation systems, and the strengths of each. It covers how retrieval-augumented generation (RAG) generation combines the two, and how orchestration helps different AI components work together. The chapter also explains tokens, vector databases, and reranking methods, along with the differences between bi-encoders and cross-encoders. Finally, it discusses essential topics like guardrails for safe AI use, the role of agents, and the importance of Model Context Protocols.

第二章:深入探讨多模态系统——本章重点介绍视觉语言模型及其在多模态人工智能中的作用。它解释了什么是视觉语言模型,比较了不同的实现方法,并探讨了它们与更广泛的多模态通用人工智能系统的区别。本章还更深入地研究了视觉语言模型,并介绍了基于输出对多模态系统进行分类的方法。

Chapter 2: Deep Dive into Multimodal Systems - This chapter focuses on vision-language models and their role in multimodal AI. It explains what vision-language models are, compares different implementation approaches, and explores how they differ from broader multimodal GenAI systems. The chapter also looks at vision-language models in more depth and introduces ways to classify multimodal systems based on their outputs.

第三章:实现单模态本地GenAI系统——本章探讨构建GenAI系统的实践方面。首先介绍GPU在当今人工智能领域的作用以及如何利用本地GPU。然后介绍Ollama,包括如何使用它生成PDF文档。接下来,解释RAG的工作原理,以及有效实现RAG所面临的关键挑战。

Chapter 3: Implementing Unimodal Local GenAI System - This chapter explores the practical side of building GenAI systems. It begins with the role of GPUs in today’s AI landscape and how to make use of a local GPU. The chapter then introduces Ollama, including how to generate a PDF document with it. Moving forward, it explains how RAG works, along with the key challenges involved in implementing RAG effectively.

第四章:实现基于单模态 API 的 GenAI 系统- 本章将通过实践操作,介绍如何使用 OpenAI 的 API 和模型。它将讲解如何从使用 OpenAI 完成基本任务逐步过渡到构建更高级的智能体 AI 解决方案。您将学习如何执行多文档查询,如何使用 OpenAI 和 Faiss 实现模块化的检索增强型生成系统,并探索一系列扩展这些功能的步骤。

Chapter 4: Implementing Unimodal API-based GenAI Systems - This chapter provides a hands-on introduction to working with OpenAI’s APIs and models. It explains how to move from using OpenAI for basic tasks to building more advanced agentic AI solutions. You will learn how to perform multi-document queries, implement a modular retrieval-augmented generation system using OpenAI and Faiss, and explore a set of to do steps for extending these capabilities further.

第五章:实现人机协同的智能生成人工智能系统——本章重点介绍智能生成人工智能系统的设计和开发。首先阐述此类系统的架构原则,然后详细介绍端到端的人机协同HITL )RAG工作流程。接下来,探讨HITL设置如何演进为多智能体HITL RAG系统。最后,本章阐明智能人工智能和人工智能代理之间的区别,重点介绍它们各自的角色和应用。

Chapter 5: Implementing Agentic GenAI Systems with Human-in-the-loop - This chapter focuses on designing and advancing agentic generative AI systems. It starts with principles of architecting such systems and then walks through an end-to-end human-in-the-loop (HITL) RAG workflow. From there, it explores how HITL setups can evolve into multi-agent HITL RAG systems. The chapter concludes by clarifying the differences between agentic AI and AI agents, highlighting their distinct roles and applications..

第六章:两阶段和多阶段GenAI系统 ——本章深入探讨了密集检索系统中交互的概念及其在RAG(评级、可用性、可寻址)中的重要性。它解释了交互模型在两阶段RAG系统中的作用,并比较了不同的重排序策略,包括延迟交互、完全交互和多向量模型。本章随后介绍了两阶段和多阶段RAG架构,讨论了用于评估检索结果的评分机制,并演示了如何实现具有路由的多阶段RAG工作流程,以获得更准确、更高效的响应。

Chapter 6: Two and Multi-stage GenAI Systems - This chapter provides a deep understanding of the concepts of interactions within dense retrieval systems and their importance in RAG. It explains the role of interaction models in two-stage RAG systems and compares different reranking strategies, including late interaction, full interaction, and multi-vector models. The chapter then introduces two-stage and multi-stage RAG architectures, discusses grading mechanisms for evaluating retrieved results, and demonstrates how to implement a multi-stage RAG workflow with routing for more accurate and efficient responses.

第七章:构建双向多模态检索系统 ——本章介绍多模态系统及其基于输出的分类方法。随后,本章阐述多模态检索系统的工作原理,并提供带有详细步骤说明的代码实现。本章最后附有练习题,供读者应用所学知识并加深理解。

Chapter 7: Building a Bidirectional Multimodal Retrieval System - This chapter introduces multimodal systems and how they can be classified based on their outputs. It then explains the working of a multimodal retrieval system and provides a code implementation with step-by-step explanation. The chapter closes with a to do section, giving readers practical exercises to apply and deepen their understanding.

第八章:构建多模态 RAG 系统——本章重点介绍使用 LLM 进行生成和评估的实用方法。首先介绍生成技术的实现,然后介绍 LLM 作为评判者的概念及其在构建推荐系统中的应用。本章还涵盖如何将评分机制与 OpenAI 集成以改进评估。最后,本章附有练习题,供读者将这些理念应用于实践。

Chapter 8: Building a Multimodal RAG System - This chapter focuses on practical approaches to generation and evaluation using LLMs. It begins with the implementation of generation techniques, followed by an introduction to the concept of LLM-as-a-judge and its application in building recommender systems. The chapter also covers how to incorporate grading mechanisms with OpenAI to improve evaluation. It concludes with a to do section, giving readers exercises to apply these ideas in practice.

第九章:基于重排序的GenAI系统构建——本章探讨了重排序的概念及其在改进检索和RAG系统中的关键作用。它解释了重排序如何在基于文本和多模态的环境中应用,重点介绍了如何在多模态RAG中使用交叉编码器。本章还介绍了多模态环境下的交叉编码器架构以及RAG系统中的多索引嵌入思想。除了这些概念之外,本章还提供了带有详细解释的代码实现,并在最后附有练习部分,以帮助读者实践并巩固理解。

Chapter 9: Building GenAI Systems with Reranking - This chapter explores the concept of reranking and its critical role in improving retrieval and RAG systems. It explains how reranking is applied in both text-based and multimodal contexts, with a focus on using cross-encoders in multimodal RAG. The chapter also introduces the cross-encoder architecture in multimodal settings and the idea of multi-index embedding within RAG systems. Alongside these concepts, it provides a code implementation with detailed explanation and concludes with a to do section to help readers practice and solidify their understanding.

第十章:多模态GenAI的检索优化 ——本章探讨如何提高检索系统的效率和效果。首先概述检索系统的常见缺陷,然后介绍各种优化技术来解决这些局限性。本章还详细探讨了检索优化,展示了如何应用这些方法来提升性能。随后,本章重点关注多模态RAG系统,解释了自适应索引刷新如何提高其准确性和响应速度。最后,本章提供了一个练习部分,供读者将这些理念应用于实践。

Chapter 10: Retrieval Optimization for Multimodal GenAI - This chapter examines how to make retrieval systems more efficient and effective. It begins by outlining common drawbacks of retrieval systems, then introduces various optimization techniques to address these limitations. The chapter also explores retrieval optimization in detail, showing how these methods can be applied to improve performance. It then shifts focus to multimodal RAG systems, explaining how adaptive index refresh can enhance their accuracy and responsiveness. Finally, it provides a to do section with exercises for readers to apply these ideas in practice.

第十一章:构建以语音为输入的多模态GenAI系统——本章探讨了RAG如何超越图像和文本的范畴。它介绍了将RAG扩展到其他模态的核心概念,并展示了如何将语音接口集成到RAG架构中。本章还提供了一个支持语音的RAG系统的分步代码实现,演示了如何将这些理念付诸实践。

Chapter 11: Building Multimodal GenAI Systems with Voice as Input - This chapter explores how RAG extends beyond just image and text. It introduces the core concepts of expanding RAG to other modalities and shows how speech interfaces can be integrated into the RAG architecture. The chapter also provides a step-by-step code implementation of a voice-enabled RAG system, demonstrating how to bring these ideas into practice.

第十二章:高级多模态GenAI系统——本章重点阐述推理在GenAI系统中的重要性。它解释了GenAI中使用的不同类型的推理,以及它们对于构建更可靠、更智能的模型为何至关重要。本章还介绍了用于评估AI系统推理能力的关键基准。

Chapter 12: Advanced Multimodal GenAI Systems - This chapter highlights the importance of reasoning in GenAI systems. It explains the different types of reasoning used in GenAI and why they matter for building more reliable and intelligent models. The chapter also introduces key benchmarks that are used to evaluate reasoning capabilities in AI systems.

第十三章:高级多模态GenAI系统实现——本章重点探讨如何通过有效的提示技术增强GenAI的推理能力。随后,本章将探索在不同阶段引入推理的专用架构——首先是在重排序阶段,用于优化结果;其次是在推荐阶段,推理有助于提供更准确、更具上下文感知能力的建议。

Chapter 13: Advanced Multimodal GenAI Systems Implementation - This chapter focuses on how reasoning can be enhanced in GenAI through effective prompting techniques. It then explores specialized architectures that bring reasoning into play at different stages—first during reranking, where results are refined, and then at the recommendation stage, where reasoning helps deliver more accurate and context-aware suggestions.

第十四章:构建文本到 SQL 系统- 本章深入探讨文本到 SQL 的复杂性,以及它为何被认为是一个具有挑战性的问题。本章首先解释基本概念,然后探讨文本到 SQL 能够产生重大影响的实际应用。本章讨论了其中的关键挑战,并提供了设计高效文本到 SQL 系统的实用指导。此外,本章还介绍了使用大型语言模型进行实体提取的方法,重点阐述了如何将其与文本到 SQL 集成以提高性能。最后,本章重点介绍了此类系统如何提高数据可访问性和可读性,同时还介绍了性能指标和最佳实践,以确保可靠性。

Chapter 14: Building Text-to-SQL Systems - This chapter delves into the complexities of text-to-SQL and why it is considered a challenging problem. It begins by explaining the basic concepts and then explores real-world applications where text-to-SQL can make a significant impact. The chapter discusses the key challenges involved, followed by practical guidance on designing an effective text-to-SQL system. It also covers entity extraction using large language models, highlighting how this integrates with text-to-SQL to improve performance. Finally, the chapter emphasizes how such systems can enhance data accessibility and literacy, while also introducing performance metrics and best practices to ensure reliability.

第十五章:智能文本转SQL系统及架构决策——本章介绍专为实时零售智能而设计的智能文本转SQL系统的设计和实现。本章详细解释了系统的架构,并提供了代码演练以帮助读者更好地理解。此外,本章还提供了一个逐步流程图,展示了系统如何处理查询并生成有意义的输出。最后,本章展示了文本转SQL系统生成的实际结果,以及这些结果如何解决最初的问题。

Chapter 15: Agentic Text-to-SQL Systems and Architecture Decision-Making - This chapter presents the design and implementation of an agentic text-to-SQL system tailored for real-time retail intelligence. It explains the system’s architecture in detail, along with code walkthroughs for better understanding. A step-by-step pipeline is provided to show how the system processes queries, leading to meaningful outputs. The chapter concludes by demonstrating the actual results generated by the text-to-SQL system and how they address the original problem statement.

第十六章:利用 GenAI 从图像中提取文本——本章介绍了三种应用 GenAI 进行光学字符识别 (OCR) 的不同方法。它解释了 OCR 如何处理图像,以及如何将其扩展到包含文本、图像和其他元素的多模态文档。本章最后附有练习题,供读者应用和巩固所学知识。

Chapter 16: GenAI for Extracting Text from Images - This chapter introduces three different approaches to applying GenAI for optical character recognition. It explains how OCR works on images, as well as how it can be extended to multimodal documents that combine text, images, and other elements. The chapter concludes with a to do section, giving readers practical exercises to apply and reinforce what they have learned.

第十七章:将传统人工智能/机器学习集成到 GenAI 工作流程中——本章通过一个详细的案例研究,探讨如何将传统机器学习模型集成到 GenAI 工作流程中。本章展示了一个混合集成学习在电信欺诈检测中的实际应用案例,说明了如何将 XGBoost 等模型封装并增强到基于机器学习的系统中。此外,本章还对机器学习模型与 GenAI 结合以创建混合解决方案的不同方法进行了比较概述。最后,本章提供了一个实践练习部分,为读者提供了加深理解的实践活动。

Chapter 17: Integrating Traditional AI/ML into GenAI Workflow - This chapter explores how traditional machine learning models can be integrated into GenAI workflows through a detailed case study. It presents a practical use case of hybrid ensemble learning for telecom fraud detection, showing how models like XGBoost can be wrapped and enhanced within an LLM-powered system. The chapter also provides a comparative overview of different ways ML models can be combined with GenAI to create hybrid solutions. It concludes with a to do section, offering readers hands-on activities to deepen their understanding.

第十八章:LLM运维与GenAI评估技术——本章重点阐述运维在构建和运行生产级GenAI应用过程中的重要性。它比较了LLM和RAG系统的评估方法,介绍了RagOps的概念,并强调了持续监控和可观测性平台的重要性。本章还探讨了图增强型RAG如何改进推荐系统,并对现代软件开发中不同的运维实践进行了比较。最后,本章提供了关于如何设置MLflow以管理实验和部署的实用指南。

Chapter 18: LLM Operations and GenAI Evaluation Techniques - This chapter highlights the importance of operations in building and running production-grade GenAI applications. It compares evaluation methods for LLMs and RAG systems, introduces the concept of RagOps, and emphasizes the need for continuous monitoring and observability platforms. The chapter also explores how graph-enhanced RAG can improve recommendation systems and provides a comparison of different Ops practices in modern software development. Finally, it offers practical guidance on setting up MLflow for managing experiments and deployments.

代码包和彩色图像

Code Bundle and Coloured Images

请点击链接下载

Please follow the link to download the

代码包和本书的彩色图片:

Code Bundle and the Coloured Images of the book:

https://rebrand.ly/78f896

https://rebrand.ly/78f896

本书的代码包也托管在 GitHub 上,地址为https://github.com/bpbpublications/Building-Multimodal-Generative-AI-and-Agentic-Applications。如果代码有任何更新,都会在现有的 GitHub 仓库中更新。

The code bundle for the book is also hosted on GitHub at https://github.com/bpbpublications/Building-Multimodal-Generative-AI-and-Agentic-Applications. In case there’s an update to the code, it will be updated on the existing GitHub repository.

我们丰富的书籍和视频资源库中包含代码包,可在https://github.com/bpbpublications获取。快来看看吧!

We have code bundles from our rich catalogue of books and videos available at https://github.com/bpbpublications. Check them out!

勘误表

Errata

BPB 出版社对我们的工作引以为豪,并遵循最佳实践,以确保内容的准确性,从而为订阅用户提供沉浸式的阅读体验。读者是我们的镜子,我们会利用他们的反馈来反思和改进出版过程中可能出现的人为错误。为了帮助我们保持质量,并及时联系任何可能因意外错误而遇到困难的读者,请写信至:

We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at :

errata@bpbonline.com

errata@bpbonline.com

BPB 出版社全体员工非常感谢您的支持、建议和反馈。

Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

目录

Table of Contents

  1. 1. 介绍新一代生成式人工智能

    1. 介绍

    2. 结构

    3. 目标

    4. 生成式人工智能概述

    5. 检索系统

      1. 稀疏检索

        1. 密集检索

    6. 发电系统

      1. 发电系统类型

      2. 自回归生成

      3. 提示策略

    7. 了解发电系统的优势所在

      1. 结合检索和生成

    8. 检索增强生成

      1. RAG 工作

      2. 基本 RAG 流水线的架构

      3. RAG架构的类型

      4. 迭代 RAG

      5. 向量数据库和 RAG

      6. RAG 的快速工程

      7. 高级 RAG 技术

      8. RAG的应用

    9. 人工智能系统中的编排

      1. RAG 系统中的编排

      2. 智能体系统中的编排

    10. 人工智能系统中的令牌

    11. 向量数据库

      1. 了解向量数据库

      2. 向量数据库中的索引算法

      3. 向量数据库中的搜索算法

      4. 嵌入和嵌入模型

      5. 向量数据库对 RAG 和智能体系统的重要性

    12. 重新排名

    13. 双编码器与交叉编码器

      1. 用于重排序的交叉编码器

    14. 护栏

      1. 护栏类型

      2. 护栏的安装方法

      3. 没有护栏

      4. 护栏解决方案的行业案例

    15. 代理人

      1. 主动型 RAG 与非主动型 RAG

    16. 模型上下文协议

    17. 结论

  2. 1. Introducing New Age Generative AI

    1. Introduction

    2. Structure

    3. Objectives

    4. Overview of generative AI

    5. Retrieval system

      1. Sparse retrieval

        1. Dense retrieval

    6. Generation system

      1. Types of generation systems

      2. Autoregressive generation

      3. Prompting strategies

    7. Understanding where generation systems excel

      1. Combining retrieval and generation

    8. Retrieval-augmented generation

      1. RAG working

      2. Architecture of a basic RAG pipeline

      3. Types of RAG architectures

      4. Iterative RAG

      5. Vector databases and RAG

      6. Prompt engineering for RAG

      7. Advanced RAG techniques

      8. Applications of RAG

    9. Orchestration in AI systems

      1. Orchestration in RAG systems

      2. Orchestration in agentic systems

    10. Tokens in AI systems

    11. Vector database

      1. Understanding vector databases

      2. Indexing algorithms in vector databases

      3. Search algorithms in vector databases

      4. Embeddings and embedding models

      5. Importance of vector databases for RAG and agentic systems

    12. Reranking

    13. Bi-encoders vs. cross-encoders

      1. Cross-encoders for reranking

    14. Guardrails

      1. Types of guardrails

      2. Methods of applying guardrails

      3. Without guardrails

      4. Industry examples of guardrail solutions

    15. Agents

      1. Agentic RAG vs. non-agentic RAG

    16. Model Context Protocols

    17. Conclusion

  3. 2. 多式联运系统深度解析

    1. 介绍

    2. 结构

    3. 目标

    4. 理解视觉语言模型

      1. 视觉语言模型的分类

      2. 视觉语言模型的核心架构组件

      3. 视觉语言模型面临的挑战

      4. 多模态GenAI系统

    5. 多模态向量嵌入

      1. 多模态向量数据库

      2. 收藏

      3. 点和点 ID

      4. 向量

      5. 有效载荷

      6. 存储和矢量存储

      7. 索引

    6. 实施方案比较

      1. 单个集合,按有效载荷分区

      2. 具有全局索引的多个集合

    7. 多模态生成式人工智能系统与虚拟语言模型

    8. 视觉语言模型

      1. 多模态生成式人工智能系统

      2. 使用视觉语言模型

      3. 使用多模态生成式人工智能系统

      4. 实际案例比较

    9. 基于输出的多模态系统分类

      1. 文本转图像系统

      2. 图像转文本系统

      3. 文本和图像系统

      4. 纯文本到规格和图像系统

      5. 文本转SQL系统

      6. 文本转代码系统

    10. 结论

  4. 2. Deep Dive into Multimodal Systems

    1. Introduction

    2. Structure

    3. Objectives

    4. Understanding vision-language models

      1. Categories of vision-language models

      2. Core architectural components of vision-language models

      3. Challenges in vision-language models

      4. Multimodal GenAI system

    5. Multimodal vector embedding

      1. Multimodal vector database

      2. Collections

      3. Points and point IDs

      4. Vectors

      5. Payload

      6. Storage and vector store

      7. Indexing

    6. Implementation comparisons

      1. Single collection, partitioned via payload

      2. Multiple collections with global indexing

    7. Multimodal generative AI systems vs. VLMs

    8. Vision-language models

      1. Multimodal generative AI systems

      2. Using vision-language models

      3. Using multimodal generative AI systems

      4. Real-world example comparison

    9. Output-based classification of multimodal systems

      1. Text-to-image systems

      2. Image-to-text systems

      3. Text and image systems

      4. Text-only to specifications and image systems

      5. Text-to-SQL systems

      6. Text-to-code systems

    10. Conclusion

  5. 3. 实现单模态本地GenAI系统

    1. 介绍

    2. 结构

    3. 目标

    4. GPU 在当今生成式人工智能系统中的应用

    5. 使用本地 GPU

      1. 建筑组件

    6. 关于奥拉玛

      1. Ollama 的替代品

    7. 使用 Ollama 生成 PDF 文档

    8. RAG 实现

      1. 加载并分块 PDF 文档

      2. LangChain 中的替代组块策略

      3. 创建带有元数据的嵌入

        1. 在代码中使用它们

      4. 语义搜索和关键词搜索相结合的混合搜索

        1. 您可以使用的其他寻回犬

      5. 对话记忆缓冲区

      6. LLM配置自然语言生成

      7. ReAct 提示模板

      8. 构建对话式问答链

      9. 用户聊天循环

    9. RAG面临的挑战

    10. 结论

  6. 3. Implementing Unimodal Local GenAI System

    1. Introduction

    2. Structure

    3. Objectives

    4. GPU in today’s generative AI systems

    5. Using a local GPU

      1. Architectural components

    6. About Ollama

      1. Alternatives to Ollama

    7. Generate a PDF document with Ollama

    8. RAG implementation

      1. Load and chunk the PDF document

      2. Alternative chunking strategies in LangChain

      3. Creating embeddings with metadata

        1. Using them in code

      4. Hybrid search with semantic and keyword

        1. Other retrievers you can use

      5. Conversation memory buffer

      6. LLM configuration natural language generation

      7. ReAct prompt template

      8. Building the conversational QA chain

      9. User chat loop

    9. Challenges in RAG

    10. Conclusion

  7. 4. 实现基于单模态 API 的 GenAI 系统

    1. 介绍

    2. 结构

    3. 目标

    4. OpenAI API 和模型入门

      1. OpenAI 公司

      2. OpenAI API 概述

    5. 核心 API 端点

      1. OpenAI 主要模型

      2. 访问 OpenAI 模型

      3. 选择合适的型号

      4. 初学者最佳实践

      5. 从 OpenAI 到智能体人工智能

      6. OpenAI 的智能体 API 生态系统

        1. 响应 API

        2. 代理 SDK

        3. 操作员

        4. 法典

        5. 助手 API

    6. 多文档查询

    7. 使用 OpenAI 实现模块化 RAG

      1. 主控制器

      2. 配置

      3. 嵌入初始化

      4. 矢量图库设置

      5. 元数据标记

      6. 文档加载和分块

      7. 杂交寻回犬

        1. 在检索过程中强制执行基于元数据的过滤

      8. 语言模型

      9. 提示模板

      10. RAG 链组件

      11. 会话记忆

      12. 依赖关系

    8. 待办事项

    9. 结论

  8. 4. Implementing Unimodal API-based GenAI Systems

    1. Introduction

    2. Structure

    3. Objectives

    4. Getting started with OpenAI APIs and models

      1. OpenAI as a company

      2. Overview of the OpenAI API

    5. Core API endpoints

      1. Major OpenAI models

      2. Accessing OpenAI models

      3. Choosing the right model

      4. Best practices for beginners

      5. From OpenAI to agentic AI

      6. OpenAI’s agentic API ecosystem

        1. Responses API

        2. Agents SDK

        3. Operator

        4. Codex

        5. Assistants API

    6. Multi-document query

    7. Implementing modular RAG with OpenAI

      1. Main controller

      2. Configuration

      3. Embedding initialization

      4. Vector store setup

      5. Metadata tagging

      6. Document loading and chunking

      7. Hybrid retriever

        1. Enforce metadata-based filtering during retrieval

      8. Language model

      9. Prompt template

      10. RAG chain assembly

      11. Conversational memory

      12. Dependencies

    8. To do

    9. Conclusion

  9. 5. 实现具有人机交互的智能基因人工智能系统

    1. 介绍

    2. 结构

    3. 目标

    4. 构建智能体GenAI系统

      1. 平行模式

      2. 序列模式

      3. 循环图案

      4. 路由器图案

      5. 聚合器模式

      6. 网络模式

      7. 层级模式

      8. 人机交互模式

      9. 共享工具模式

      10. 带有工具模式的数据库

      11. 使用工具进行内存转换

      12. 规划者-执行者模式

      13. 批评者或验证者模式

      14. 谈判者模式

      15. 多模态代理模式

      16. 投票或共识模式

      17. 主管-下属模式

      18. 监视或恢复模式

      19. 时间规划模式

        1. 人机交互

    5. 端到端的人机交互 RAG 工作流程

    6. 从 HITL 到多智能体人机交互 RAG

    7. 智能体人工智能与人工智能代理

    8. 结论

  10. 5. Implementing Agentic GenAI Systems with Human-in-the-loop

    1. Introduction

    2. Structure

    3. Objectives

    4. Architecting agentic GenAI systems

      1. Parallel pattern

      2. Sequential pattern

      3. Loop pattern

      4. Router pattern

      5. Aggregator pattern

      6. Network pattern

      7. Hierarchical pattern

      8. Human-in-the-loop pattern

      9. Shared tools pattern

      10. Database with tools pattern

      11. Memory transformation using tools

      12. Planner-executor pattern

      13. Critic or validator pattern

      14. Negotiator pattern

      15. Multimodal agent pattern

      16. Voting or consensus pattern

      17. Supervisor-subordinate pattern

      18. Watchdog or recovery pattern

      19. Temporal planner pattern

        1. Human-in-the-loop

    5. End-to-end human-in-the-loop RAG workflow

    6. From HITL to multi-agent human-in-the-loop RAG

    7. Agentic AI vs. AI agents

    8. Conclusion

  11. 6. 两阶段和多阶段GenAI系统

    1. 介绍

    2. 结构

    3. 目标

    4. 密集检索中的交互概念

      1. 无互动

      2. 完全互动

      3. 后期互动

      4. 多向量表示

      5. 与后期交互架构的差异

    5. 交互模型在两阶段 RAG 系统中的作用

      1. 检索阶段的互动

    6. 使用各种交互模型进行重排序

      1. 集成到两阶段 RAG 架构中

    7. 两阶段 RAG 架构

      1. 第一阶段密集检索

      2. 第二阶段,语义精确度重排序

      3. 两阶段设计的战略作用

      4. 两阶段 RAG 与后期互动

        1. ColBERT 和 ColPali 的功能

        2. 使用两阶段 RAG

    8. 多阶段 RAG

      1. 超越两阶段系统

      2. 多阶段 RAG 的组成部分

      3. 多阶段 RAG 的优势

        1. 多阶段 RAG 的类型

    9. 评分机制

      1. 挑战与考量

      2. 多阶段 RAG 系统中的代币利用率

      3. 评分类型

    10. 实现多阶段 RAG 工作流程及路由

    11. 结论

  12. 6. Two and Multi-stage GenAI Systems

    1. Introduction

    2. Structure

    3. Objectives

    4. Concepts of interactions in dense retrievals

      1. No interaction

      2. Full interaction

      3. Late interaction

      4. Multi-vector representations

      5. Differentiation from late interaction architectures

    5. Role of interaction models in two-stage RAG systems

      1. Interaction in the retrieval phase

    6. Reranking with various interaction models

      1. Integration into two-stage RAG architectures

    7. Two-stage RAG architecture

      1. Stage one dense retrievals

      2. Stage-two, reranking for semantic precision

      3. The strategic role of two-stage design

      4. Two-stage RAG vs. late interaction

        1. Capabilities of ColBERT and ColPali

        2. Use of two-stage RAG

    8. Multi-stage RAG

      1. Beyond two-stage systems

      2. Components of multi-stage RAG

      3. Benefits of multi-stage RAG

        1. Types of multi-stage RAG

    9. Grading mechanisms

      1. Challenges and considerations

      2. Token utilization in multi-stage RAG systems

      3. Grading types

    10. Implementation of multi-stage RAG workflow with routing

    11. Conclusion

  13. 7. 构建双向多模态检索系统

    1. 介绍

    2. 结构

    3. 目标

    4. 基于输出的多模态系统分类

      1. 集成和设计影响

    5. 理解多模态检索系统

      1. 技术架构

        1. 应用及影响

    6. 代码实现及说明

      1. 要求

        1. 前端

        2. 数据目录

        3. 检索系统

        4. 装载机

      2. 嵌入工具

        1. 索引构建器

        2. 运行整个代码的过程

    7. 为了读者们

    8. 结论

  14. 7. Building a Bidirectional Multimodal Retrieval System

    1. Introduction

    2. Structure

    3. Objectives

    4. Output-based classification of multimodal systems

      1. Integration and design implications

    5. Understanding a multimodal retrieval system

      1. Technical architecture

        1. Applications and implications

    6. Code implementation and explanation

      1. Requirement

        1. Frontend

        2. Data directory

        3. The retrieval system

        4. Loaders

      2. Embedding utils

        1. Index builder

        2. Process to run the entire code

    7. To do for the readers

    8. Conclusion

  15. 8. 构建多模式 RAG 系统

    1. 介绍

    2. 结构

    3. 目标

    4. 生成过程的实现

      1. 架构组件和工作流程

        1. 发电机

    5. 基于多模态LLM的推荐系统

      1. 领先架构及示例

    6. 将评分与 OpenAI 结合使用

      1. 导入语句

      2. 生成式响应评分器

      3. 检索相关性评分器

      4. 分级和生成模型

    7. 用于评分的云端LLM

      1. 法学硕士担任法官

      2. 原理和功能

    8. 待办事项

    9. 结论

  16. 8. Building a Multimodal RAG System

    1. Introduction

    2. Structure

    3. Objectives

    4. Implementation of generation

      1. Architectural components and workflow

        1. Generator

    5. Multimodal LLM-based recommender system

      1. Leading architectures and examples

    6. Incorporate grading with OpenAI

      1. Import statements

      2. Generative responsive grader

      3. Retrieval relevance grader

      4. Grading and generation models

    7. Cloud LLMs for grading

      1. LLM-as-a-judge

      2. Rationale and functionality

    8. To do

    9. Conclusion

  17. 9. 利用重排序构建 GenAI 系统

    1. 介绍

    2. 结构

    3. 目标

    4. 重新排名

    5. 信息检索和RAG系统中的重排序

      1. RAG管道中的重新排序

    6. 基于交叉编码器的多模态RAG重排序

    7. 多模态环境下的交叉编码器架构

      1. 交叉编码器与后期交互重排序器

      2. 多模态检索中的应用

      3. 商业排名器

      4. 交叉编码器概述

        1. 交叉编码器及其在嵌入中的作用

    8. RAG系统中的多索引嵌入

    9. 代码实现及说明

    10. 待办事项

      1. 安装说明

    11. 结论

  18. 9. Building GenAI Systems with Reranking

    1. Introduction

    2. Structure

    3. Objectives

    4. Reranking

    5. Reranking in information retrieval and RAG systems

      1. Reranking in RAG pipelines

    6. Reranking using cross-encoder in multimodal RAG

    7. Cross-encoder architecture in multimodal settings

      1. Cross-encoders vs. late interaction rerankers

      2. Applications in multimodal retrieval

      3. Commercial reranker

      4. Recap of cross-encoder

        1. Cross-encoders and their role in embedding

    8. Multi-index embedding in RAG systems

    9. Code implementation and explanation

    10. To do

      1. Setup instructions

    11. Conclusion

  19. 10. 多模态 GenAI 的检索优化

    1. 介绍

    2. 结构

    3. 目标

    4. 检索优化技术

    5. 回收系统的缺点

    6. 检索优化技术缓解了这些限制

      1. 多索引嵌入

      2. 基于模态的多模态查询路由

      3. 查询扩展

      4. 嵌入归一化

      5. 混合检索

        1. 分数标准化

      6. 使用交叉编码器进行重排序

        1. 预滤波阈值

      7. 自适应索引刷新

      8. 检索优化技术

        1. GA实现用于优化模态

        2. 解释

      9. 具有自适应索引刷新的多模态 RAG 系统

      10. 一次性或计划性索引刷新脚本

    7. 利用自适应刷新增强多模态 RAG

      1. Qdrant中的向量嵌入管道和存储

      2. 两阶段检索和多向量重排序

      3. 上下文组装和语言生成

      4. 自适应嵌入刷新机制

      5. 索引行为

    8. 待办事项

    9. 结论

  20. 10. Retrieval Optimization for Multimodal GenAI

    1. Introduction

    2. Structure

    3. Objectives

    4. Retrieval optimization techniques

    5. Drawbacks retrieval systems

    6. Retrieval optimization techniques mitigating the limitations

      1. Multi-index embedding

      2. Modality-based routing for multimodal queries

      3. Query expansion

      4. Embedding normalization

      5. Hybrid retrieval

        1. Score normalization

      6. Reranking with cross-encoders

        1. Prefiltering thresholds

      7. Adaptive index refresh

      8. Retrieval optimization techniques

        1. GA implementation for optimizing modality

        2. Explanation

      9. Multimodal RAG system with adaptive index refresh

      10. One-time or scheduled index refresh script

    7. Enhancing multimodal RAG with adaptive refresh

      1. Vector embedding pipeline and storage in Qdrant

      2. Two-stage retrieval and multi-vector reranking

      3. Context assembly and language generation

      4. Adaptive embedding refresh mechanism

      5. Indexing behavior

    8. To do

    9. Conclusion

  21. 11. 构建以语音为输入的多模态GenAI系统

    1. 介绍

    2. 结构

    3. 目标

    4. RAG 超越图像和文本 RAG

    5. 概念

    6. 将语音接口集成到 RAG 架构中

    7. 语音驱动的 RAG 系统的代码实现

      1. 技术栈概述

      2. 前端

      3. 主语音管道

    8. 结论

  22. 11. Building Multimodal GenAI Systems with Voice as Input

    1. Introduction

    2. Structure

    3. Objectives

    4. RAG beyond image and text RAG

    5. Concepts

    6. Integrating speech interfaces into RAG architecture

    7. Code implementation of the voice-enabled RAG system

      1. Tech stack overview

      2. Frontend

      3. Main voice-enabled pipeline

    8. Conclusion

  23. 12. 先进的多模态基因人工智能系统

    1. 介绍

    2. 结构

    3. 目标

    4. 推理在生成式人工智能系统中的关键作用

      1. 从世代到审议

      2. 人工智能系统中的信任和可解释性

      3. 处理歧义和消除歧义

      4. 多模式整合需要逻辑组合

      5. 快捷工程和CoT推理

      6. 重新排序和元推理

      7. 学习可推广的策略

      8. 人机协作

      9. 智能人工智能基金会

    5. GenAI中的推理及其类型

      1. 人工智能中的演绎推理

      2. 人工智能中的归纳推理

      3. 人工智能中的溯因推理

      4. 人工智能中的类比推理

      5. 常识推理

      6. 因果推理

      7. 空间推理

      8. 时间推理

      9. 数学推理

      10. 基于工具的推理和ReAct代理

      11. 人工智能系统中的多模态推理与融合

    6. 关于推理基准

    7. 结论

  24. 12. Advanced Multimodal GenAI Systems

    1. Introduction

    2. Structure

    3. Objectives

    4. The critical role of reasoning in generative AI systems

      1. From generation to deliberation

      2. Trust and explainability in AI systems

      3. Handling ambiguity and disambiguation

      4. Multimodal integration requires logical composition

      5. Prompt engineering and CoT reasoning

      6. Reranking and meta-reasoning

      7. Learning generalizable strategies

      8. Human-AI collaboration

      9. Foundation for agentic AI

    5. Reasoning in GenAI and their types

      1. Deductive reasoning in AI

      2. Inductive reasoning in AI

      3. Abductive reasoning in AI

      4. Analogical reasoning in AI

      5. Commonsense reasoning

      6. Causal reasoning

      7. Spatial reasoning

      8. Temporal reasoning

      9. Mathematical reasoning

      10. Tool-based reasoning and ReAct agents

      11. Multimodal reasoning and fusion in AI systems

    6. About reasoning benchmark

    7. Conclusion

  25. 13. 高级多模态GenAI系统实现

    1. 介绍

    2. 结构

    3. 目标

    4. 基因人工智能系统中推理的提示技术

      1. 基本提示技巧

        1. 零次提示

        2. 少镜头提示

      2. 基因人工智能系统中推理的高级提示策略

    5. 重排序阶段推理的架构

      1. 模块:loaders.py

      2. 模块:embedding_utils.py

      3. 模块:index_builder.py

      4. 模块:reranker.py

      5. 模块:langgraph_agent.py

        1. langgraph_agent.py 模块的代理特性

        2. 代理属性和功能

    6. 推荐阶段推理的架构

      1. 数据集

      2. 推荐引擎的目标

        1. 最终检索限制

      3. 模块化代码库分解

    7. 结论

  26. 13. Advanced Multimodal GenAI Systems Implementation

    1. Introduction

    2. Structure

    3. Objectives

    4. Prompting techniques for reasoning in GenAI systems

      1. Basic prompting techniques

        1. Zero-shot prompting

        2. Few-shot prompting

      2. Advanced prompting strategies for reasoning in GenAI systems

    5. Architecture for reasoning at the reranking stage

      1. Module: loaders.py

      2. Module: embedding_utils.py

      3. Module: index_builder.py

      4. Module: reranker.py

      5. Module: langgraph_agent.py

        1. Agentic characteristics of the langgraph_agent.py module

        2. Agentic attributes and functionality

    6. Architecture for reasoning at the recommendation stage

      1. The dataset

      2. Goal of the recommendation engine

        1. Final retrieval constraints

      3. Modular codebase breakdown

    7. Conclusion

  27. 14. 构建文本到 SQL 系统

    1. 介绍

    2. 结构

    3. 目标

    4. 文本转SQL是一个难题

    5. 理解基本概念

    6. 探索实际应用

    7. 主要挑战

    8. 关于设计文本到 SQL 系统的实用指南

    9. 使用LLM和文本到SQL系统的实体提取

      1. 架构概述

    10. 提高数据可访问性和可读性

    11. 绩效指标和最佳实践

      1. 精确匹配准确度

      2. 执行准确率

      3. 组件级精度

      4. 查询执行成功率

      5. 语义等价和规范化

      6. 人为评估

      7. 延迟和吞吐量指标

      8. 绩效评估最佳实践

    12. 结论

  28. 14. Building Text-to-SQL Systems

    1. Introduction

    2. Structure

    3. Objectives

    4. Text-to-SQL a hard problem

    5. Understanding basic concepts

    6. Exploration of real-world applications

    7. Key challenges

    8. Practical guidance on designing a text-to-SQL system

    9. Entity extraction using LLM and text-to-SQL system

      1. Architecture overview

    10. Enhance data accessibility and literacy

    11. Performance metrics and best practices

      1. Exact match accuracy

      2. Execution accuracy

      3. Component-level accuracy

      4. Query execution success rate

      5. Semantic equivalence and canonicalization

      6. Human evaluation

      7. Latency and throughput metrics

      8. Best practices for performance evaluation

    12. Conclusion

  29. 15. 智能体文本到 SQL 系统及架构决策

    1. 介绍

    2. 结构

    3. 目标

    4. 用于实时零售情报的代理式文本到 SQL 系统

      1. 业务挑战和问题陈述

    5. 文本到 SQL 系统的架构和代码说明

    6. 逐步管道说明

      1. 文件夹结构

        1. 要求

        2. 安装说明

      2. 理解每个 Python 脚本

        1. 主执行层

        2. 代理模块

        3. 核心基础设施层

        4. 面向任务的模块

        5. 前端界面

        6. 系统设置和索引初始化

      3. 代码的内部运作

        1. 代理和工具概述

    7. 文本转 SQL 系统的输出

      1. 实体和数据库详细摘要

        1. 生成的 SQL 查询

        2. SQL 查询等级

        3. 总结等级

    8. 针对初始问题陈述的解决方案

    9. 结论

  30. 15. Agentic Text-to-SQL Systems and Architecture Decision-Making

    1. Introduction

    2. Structure

    3. Objectives

    4. Agentic text-to-SQL system for real-time retail intelligence

      1. Business challenge and problem statement

    5. Architecture and code explanation of text-to-SQL system

    6. Step-by-step pipeline explanation

      1. Folder structure

        1. Requirements

        2. Setup instructions

      2. Understanding each Python script

        1. Main execution layer

        2. Agent modules

        3. Core infrastructure layer

        4. Task-oriented modules

        5. Frontend interface

        6. System setup and index initialization

      3. Inner workings of the code

        1. Agent and tool summary

    7. Output from the text-to-SQL system

      1. Detailed entity and database summary

        1. Generated SQL query

        2. SQL query grade

        3. Summary Grade

    8. Solution to the initial problem statement

    9. Conclusion

  31. 16. GenAI 从图像中提取文本

    1. 介绍

    2. 结构

    3. 目标

    4. 基于GenAI的OCR的三种方法

      1. 购物助手用例

    5. 图像OCR识别

      1. 购物协助

      2. 架构概述

      3. 了解输出

    6. 对多模态文档进行光学字符识别

      1. Mistral 的 OCR

      2. 上下文中的正则表达式

      3. 收据数据中的 OCR 功能

    7. 待办事项

    8. 结论

  32. 16. GenAI for Extracting Text from Images

    1. Introduction

    2. Structure

    3. Objectives

    4. Three approaches to GenAI-based OCR

      1. Shopping assistance use case

    5. OCR on image

      1. Building shopping assistance

      2. Architecture overview

      3. Understanding the output

    6. OCR on a multimodal document

      1. Mistral's OCR

      2. The regex in context

      3. OCR in receipt data

    7. To do

    8. Conclusion

  33. 17. 将传统人工智能/机器学习集成到 GenAI 工作流程中

    1. 介绍

    2. 结构

    3. 目标

    4. 案例研究

    5. 将传统模型与世代人工智能相结合

      1. 这些混合系统的初始化

    6. 用例

      1. 数据特征和预处理

      2. 基线模型开发与评估

      3. 堆叠式集成学习方法

      4. 在此设置中 LLM 的作用

    7. 将 XG 提升模型封装到 LLM 中

      1. 运行顺序

      2. 代码实现

      3. 模型训练流程

      4. FastAPI 服务层

      5. 用于 FastAPI 推理的工具封装

      6. LangChain 工具注册

      7. 通过 Ollama 与 Mistral 进行代理编排

    8. GenAI工作流程中机器学习模型集成的比较概述

    9. 待办事项

    10. 结论

  34. 17. Integrating Traditional AI/ML into GenAI Workflow

    1. Introduction

    2. Structure

    3. Objectives

    4. Case study

    5. Integrating the traditional model with GenAI

      1. Initialization of these hybrid systems

    6. Use case

      1. Data characteristics and preprocessing

      2. Baseline model development and evaluation

      3. Stacked ensemble learning approach

      4. Purpose of the LLM in this setup

    7. Wrapping XG boost model into LLM

      1. Run order

      2. Code implementation

      3. Model training pipeline

      4. FastAPI serving layer

      5. Tool wrapper for FastAPI inference

      6. LangChain tool registration

      7. Agent orchestration with Mistral via Ollama

    8. Comparative overview of ML model integration in GenAI workflows

    9. To do

    10. Conclusion

  35. 18. LLM 操作和 GenAI 评估技术

    1. 介绍

    2. 结构

    3. 目标

    4. 生产级 GenAI 应用中运维的重要性

    5. 比较LLM和RAG评估

      1. LLM评估

      2. RAG 评估

        1. 区分的重要性

        2. 评估是GenAI运维的核心

        3. 确保大规模输出质量

      3. 监测漂移和幻觉

      4. 评估检索质量以进行预先调试

      5. 支持版本控制和可追溯性

      6. 反馈回路和自愈系统

    6. RAGOps

      1. 在开发过程中

        1. 开发过程中 RAGOps 的识别

        2. 在 RAGOps 开发过程中进行基准测试

      2. 后期开发

        1. 识别后期发展

        2. RAG系统开发后的基准测试

    7. 持续监测

      1. 实时 RAG 系统中的持续监控

      2. RAGOps 中需要监控的关键指标

      3. 连续监测的技术和工具

      4. 警报、仪表盘和异常检测

      5. 反馈回路和自愈系统

    8. 可观测性平台

      1. 核心可观测性平台

      2. RAG专用评估库

      3. 辅助工具和生态系统集成

    9. 基于图增强的 RAG 推荐系统

      1. 数据摄取管道

      2. 检索和推荐流程

        1. 系统中的智能 RAG 设计和多工具检索

        2. 代理控制回路

        3. 三种互补的检索工具

        4. 代理人的运作角色

        5. 运营风险分析和监控指标

    10. 现代软件开发中各种操作的比较

    11. 安装 MLflow

      1. 可观测性管道

      2. 方法一

      3. 方法二

      4. 使用本地文件系统结构排查 MLflow 故障

    12. 结论

  36. 18. LLM Operations and GenAI Evaluation Techniques

    1. Introduction

    2. Structure

    3. Objectives

    4. Importance of Ops in production-grade GenAI applications

    5. Comparing LLM and RAG evaluations

      1. LLM evaluation

      2. RAG evaluation

        1. Importance of distinction

        2. Evaluation as the core of GenAI Ops

        3. Ensuring output quality at scale

      3. Monitoring drift and hallucinations

      4. Evaluating retrieval quality for preemptive debugging

      5. Supporting version control and traceability

      6. Feedback loops and self-healing systems

    6. RAGOps

      1. During development

        1. Identification in RAGOps during development

        2. Benchmarking in RAGOps during development

      2. Post-development

        1. Identify post-development

        2. Benchmarking in RAG systems post-development

    7. Continuous monitoring

      1. Continuous monitoring in live RAG systems

      2. Key metrics to monitor in RAGOps

      3. Techniques and tools for continuous monitoring

      4. Alerting, dashboards, and anomaly detection

      5. Feedback loop and self-healing systems

    8. Observability platforms

      1. Core observability platforms

      2. RAG-specific evaluation libraries

      3. Auxiliary tools and ecosystem integrations

    9. Graph-enhanced RAG-based recommendation system

      1. Data ingestion pipeline

      2. Retrieval and recommendation pipeline

        1. Agentic RAG design and multi-tool retrieval in the system

        2. Agentic control loop

        3. Three complementary retrieval tools

        4. Operational role of the agent

        5. Operational risk analysis and monitoring metrics

    10. Comparison of various Ops in modern software development

    11. Installation of MLflow

      1. Observability pipeline

      2. Approach 1

      3. Approach 2

      4. Troubleshooting MLflow using local filesystem structure

    12. Conclusion

  37. 指数

  38. Index

C推出新一代生成式人工智能

CHAPTER 1Introducing New Age Generative AI

介绍

Introduction

本章通过介绍基本概念和基础技术,为掌握新一代生成式人工智能GenAI )系统奠定基础。我们首先探讨检索系统和生成系统的区别,然后深入研究向量数据库、搜索算法、嵌入技术、索引和重排序,这些都是构建智能高效人工智能解决方案的关键要素。此外,我们还讨论了反射和防护机制等关键可靠性机制,以确保输出结果的稳健性并与用户意图保持一致。

This chapter sets the stage for mastering new age generative AI (GenAI) systems by introducing essential concepts and foundational technologies. We begin by exploring the difference between retrieval systems and generation systems, followed by an in-depth look at vector databases, search algorithms, embedding techniques, indexing, and reranking, all critical for building intelligent, efficient AI solutions. Key reliability mechanisms, such as reflection and guardrails, are discussed to ensure outputs remain robust and aligned with user intent.

接下来,我们将深入探讨诸如思维链CoT )等高级提示方法,以引导人工智能模型进行结构化推理。本章随后转向智能体人工智能,涵盖智能体、工具、推理、规划和动作执行,并扩展到能够执行复杂协作任务的多智能体系统的设计。本章还提供了大型语言模型LLM )、大型视觉模型LVM )和新兴的大型动作模型LAM )的比较概述,以及关于本地模型部署和图形处理器GPU )基础设施规划的实用见解。

We then dive into advanced prompting methods like chain of thought (CoT) to guide AI models through structured reasoning processes. Moving into agentic AI, the chapter covers agents, tools, reasoning, planning, and action execution, expanding into the design of multi-agent systems capable of complex, collaborative tasks. A comparative overview of large language models (LLMs), large vision models (LVMs), and emerging large action models (LAMs) is provided, along with practical insights into local model deployment and graphics processing unit (GPU) infrastructure planning.

此外,我们还介绍了语音技术,包括自动语音识别ASR )和语音生成,并阐述了内存管理在基于代理的架构中的关键作用。最后,我们介绍了模型上下文协议MCP )等行业标准,并区分了GenAI开发人员和GenAI工程师不断变化的职责,为读者进行高级系统设计做好准备。

Further, we introduce speech technologies, including automated speech recognition (ASR) and generation, and explain the critical role of memory management in agent-based architectures. Finally, we present industry standards like Model Context Protocol (MCP) and differentiate the evolving responsibilities of a GenAI developer vs. a GenAI engineer, preparing readers for advanced system design.

结构

Structure

本章涵盖以下主题:

This chapter covers the following topics:

  • 生成式人工智能概述
  • Overview of generative AI
  • 检索系统
  • Retrieval system
  • 发电系统
  • Generation systems
  • 了解发电系统的优势所在
  • Understanding where generation systems excel
  • 检索增强生成
  • Retrieval-augmented generation
  • 人工智能系统中的编排
  • Orchestration in AI systems
  • 人工智能系统中的令牌
  • Tokens in AI systems
  • 向量数据库
  • Vector database
  • 重新排名
  • Reranking
  • 双编码器与交叉编码器
  • Bi-encoders vs. cross-encoders
  • 护栏
  • Guardrails
  • 代理人
  • Agents
  • 模型上下文协议
  • Model Context Protocols

目标

Objectives

本章旨在帮助读者全面理解设计和部署现代全智能(GenAI)系统所需的关键构建模块。通过探讨检索和生成系统、向量数据库、嵌入技术、高级提示策略、智能体架构和多智能体协作等概念,读者将为构建智能、可扩展的人工智能解决方案奠定坚实的基础。此外,本章还介绍了本地模型部署、GPU 基础设施、语音处理、智能体内存管理以及 MCP 等行业标准等关键主题。这些基础要素对于构建多模态、可靠且可用于生产环境的人工智能应用至关重要。

This chapter aims to equip readers with a comprehensive understanding of the key building blocks essential for designing and deploying modern GenAI systems. By exploring concepts such as retrieval and generation systems, vector databases, embedding techniques, advanced prompting strategies, agentic architectures, and multi-agent collaboration, readers will gain a strong foundation for building intelligent, scalable AI solutions. Additionally, the chapter introduces critical topics like local model deployment, GPU infrastructure, speech processing, memory management in agents, and industry standards like MCPs. These foundational elements are crucial for advancing toward multimodal, reliable, and production-ready AI applications.

生成式人工智能概述

Overview of generative AI

生成模型的演进代表了人工智能领域最重要的范式转变之一。在预训练Transformer GPT )出现之前的时代,生成人工智能(GenAI)的发展主要得益于玻尔兹曼机、变分自编码器VAE )、生成对抗网络GAN )和自编码器等强大技术。这些模型通过生成图像、音频甚至文本等非结构化数据,取得了突破性的成果。例如,GAN彻底革新了逼真图像合成,而VAE则实现了复杂数据空间(包括语音和文档生成)的概率生成建模。

The evolution of generative models represents one of the most significant paradigm shifts in AI. In the pre-generative pre-trained transformers (GPTs) era, GenAI was shaped by powerful techniques such as Boltzmann machines, variational autoencoders (VAEs), generative adversarial networks (GANs), and autoencoders. These models achieved groundbreaking results by generating unstructured data like images, audio, and even text. For instance, GANs revolutionized realistic image synthesis, while VAEs enabled probabilistic generative modeling of complex data spaces, including speech and document generation.

尽管这些早期系统令人印象深刻,但它们通常专注于单一领域的生成,推理、规划或跨任务泛化的能力有限。它们缺乏现代人工智能体验所具备的丰富的上下文理解能力、动态推理能力和任务驱动的灵活性。

While impressive, these earlier systems generally focused on single-domain generation with limited ability to reason, plan, or generalize across tasks. They lacked the rich contextual understanding, dynamic reasoning, and task-driven flexibility that define modern AI experiences.

真正的范式转变并非直接源于 GPT 模型,而是源于 2017 年 Transformer 架构本身的引入(Vaswani 等人在开创性论文《Attention Is All You Need 》中提出)。Transformer 引入了自注意力、并行处理和位置编码的概念,使得模型在规模和能力上都能大规模扩展,远远超出了基于传统循环神经网络( RNN )、长短期记忆网络( LSTM ) 或卷积神经网络( CNN ) 的生成模型的局限性。

The true paradigm shift occurred not directly with GPT models, but with the introduction of the transformer architecture itself in 2017 (in the seminal paper Attention Is All You Need by Vaswani et al.). The transformer introduced the concepts of self-attention, parallel processing, and positional encoding, enabling models to scale massively in both size and capability, far beyond the limits of traditional recurrent neural networks (RNNs), long short-term memories (LSTMs), or convolutional neural networks (CNNs) based generative models.

在Transformer模型的基础上,GPT开启了开放式生成模型的时代,这些模型不仅能够重现数据,还能执行对话、推理、摘要、代码生成和多模态合成等任务。现代GenAI系统如今展现出语义感知、动态问题解决能力以及跨文本、图像和语音的多模态理解能力。

Building on the transformer foundation, GPTs ushered in the era of open-ended generation models capable of not just recreating data but performing tasks like conversation, reasoning, summarization, code generation, and multimodal synthesis. The modern GenAI systems now exhibit semantic awareness, dynamic problem-solving, and multimodal understanding across text, images, and speech.

新时代以以下几项关键进步为标志:

Several key advancements define this new age, which are as follows:

  • 在多样化、异构的数据集上进行大规模预训练。
  • Massive pre-training on diverse, heterogeneous datasets.
  • 扩展规律表明,随着参数、数据和计算量的增加,改进效果可预测。
  • Scaling laws showing predictable improvements with more parameters, data, and compute.
  • CoT提示技术在引导推理中的应用。
  • CoT prompting techniques for guided reasoning.
  • 智能体人工智能架构,其中模型不仅可以生成数据,还可以推理、计划和行动。
  • Agentic AI architectures where models not only generate but also reason, plan, and act.
  • 多智能体系统协作实现复杂目标。
  • Multi-agent systems collaborating toward complex goals.
  • 跨文本、视觉和音频模态的多模态生成。
  • Multimodal generation across text, vision, and audio modalities.
  • GPU基础设施的改进和高效模型推动了私有和本地部署的发展。
  • Private and local deployments driven by improvements in GPU infrastructure and efficient models.

注:本书仅聚焦于新一代生成人工智能(GenAI)系统。如果您想了解包括玻尔兹曼机、自编码器、变分自编码器(VAE)和生成对抗网络(GAN)在内的传统生成模型的基础知识,可以参考我和我的合著者撰写的另一本书,名为《学习Python生成人工智能:从自编码器到Transformer再到大型语言模型》(由BPB出版社出版)。该书详细介绍了经典生成建模的发展历程,并最终引向当今的前沿系统。

Note: The scope of this book is focused exclusively on new-age GenAI systems. If you seek to explore the foundations of older generative models, including Boltzmann machines, autoencoders, VAEs, and GANs, you can refer to another book authored by me and my co-author, titled "Learn Python Generative AI: Journey from Autoencoders to Transformers to Large Language Models" (published by BPB Publications). It provides a detailed walkthrough of the classical generative modelling journey leading to today's cutting-edge systems.

本书超越了传统的人工智能生成方式,着重探讨如何设计、构建和部署面向推理、规划和行动的全息人工智能(GenAI)——这些系统正在改变着各行各业、企业以及人们的日常生活。理解这一转变至关重要:最初只是数据模仿,如今已发展成为能够增强和自动化人类思维的智能多模态代理。

In this book, we move beyond classical generation, focusing on designing, building, and deploying reasoning, planning, and action-oriented GenAI—the systems that are now transforming industries, enterprises, and everyday experiences. Understanding this transition is key: what started as data mimicry has evolved into intelligent, multimodal agents capable of augmenting and automating human thought itself.

虽然生成模型已经发展到能够创造出丰富且类人的输出,但并非所有人工智能解决方案都仅仅依赖于生成。事实上,当今许多最强大的人工智能系统都将检索与生成相结合,以使输出结果与现实世界的信息紧密相连,从而提高可靠性并减少虚假信息。

While generative models have evolved to create rich, human-like outputs, not all AI solutions rely solely on generation. In fact, many of the most powerful AI systems today combine retrieval with generation to ground their outputs in real-world information, improve reliability, and reduce hallucinations.

在探讨生成策略之前,首先必须了解检索系统,它是人工智能查找、筛选并将相关知识引入对话的核心。检索是现代人工智能基础设施的关键支柱,支持从搜索引擎和推荐系统到高级检索增强生成RAG )流程等各种任务。

Before exploring generation strategies, it is essential to first understand retrieval systems, the backbone of how AI finds, filters, and brings relevant knowledge into the conversation. Retrieval forms a critical pillar of modern AI infrastructure, supporting tasks ranging from search engines and recommendation systems to advanced retrieval-augmented generation (RAG) pipelines.

在下一节中,我们将探讨什么是检索系统,它们与纯粹的生成模型有何不同,以及为什么它们对于构建准确、可扩展和生产级的 AI 应用程序来说是必不可少的。

In the next section, we will explore what retrieval systems are, how they differ from pure generative models, and why they are indispensable for building accurate, scalable, and production-grade AI applications.

检索系统

Retrieval system

如今,人工智能系统因其创造力和推理能力而备受赞誉,但许多智能行为的背后都建立在强大的检索机制之上。检索机制通常是人工智能的隐形引擎,它使人工智能能够将输出结果与现实世界的知识联系起来,找到相关事实,并在对话或任务中保持逻辑一致性。要真正理解检索机制如何成为现代人工智能的关键支柱,首先需要了解它的演变历程,从简单的关键词匹配到复杂、学习驱动和记忆增强的技术。

GenAI systems today are celebrated for their creativity and reasoning abilities, but behind many of these intelligent behaviors lies a strong foundation built on retrieval mechanisms. Retrieval is often the hidden engine that allows AI to ground its outputs in real-world knowledge, find relevant facts, and maintain coherence across conversations or tasks. To truly appreciate how retrieval has become such a critical pillar of modern AI, it is important to first understand how it evolved, from simple keyword matching to sophisticated, learning-driven, and memory-augmented techniques.

在了解现代检索系统之前,简要回顾一下它们的发展历程是有帮助的,如下表所示:

Prior to understanding modern retrieval systems, it is helpful to trace their evolution briefly, which is discussed in the following table:

Year

里程碑

Milestone

描述

Description

1970年代-2000年代

1970s-2000s

词频-逆文档频率TF-IDF ),最佳匹配 25 BM25 )。

Term frequency–inverse document frequency (TF-IDF), Best Matching 25 (BM25).

早期的基于关键词的检索方法侧重于匹配精确的词语。

Early keyword-based retrieval methods focused on matching exact terms.

2020

2020

密集通道检索DPR

Dense passage retrieval (DPR)

引入密集嵌入以在语义上匹配问题和文档。

Introduced dense embeddings to semantically match questions and documents.

2021

2021

混合检索

Hybrid retrieval

结合稀疏(BM25)和稠密(DPR)方法来提高鲁棒性。

Combined sparse (BM25) and dense (DPR) methods to improve robustness.

2020–2022

2020–2022

抹布

RAG

将检索与生成模型紧密结合,以增强接地效果。

Tight integration of retrieval with generation models to enhance grounding.

2023年及以后

2023+

情境学习检索,记忆增强检索。

In-context learning retrieval, memory-augmented retrieval.

动态的、推理驱动的检索功能嵌入到 LLM 工作流程中。

Dynamic, reasoning-driven retrieval embedded inside LLM workflows.

表 1.1:检索系统的历史时间线

Table 1.1: Historic timelines of retrieval systems

结合表 1.1中给出的背景信息,我们可以清楚地看到,检索不再是简单的查找过程;它已经演变为一个动态的、智能的层,能够主动增强人工智能系统的推理能力。在接下来的章节中,我们将探讨检索系统的工作原理、使其功能强大的关键组件,以及它们如何与生成模型无缝集成,从而构建可靠的、具有上下文感知能力的人工智能应用。

With the preceding background, given in Table 1.1, in mind, it becomes clear that retrieval is no longer a simple lookup process; it has evolved into a dynamic, intelligent layer that actively augments the reasoning capabilities of AI systems. In the following sections, we will explore how retrieval systems work, the key components that make them powerful, and how they integrate seamlessly with generative models to build reliable, context-aware AI applications.

现代检索系统的基础可以追溯到早期的一些创新,例如Facebook AI Research(现为Meta AI)在2020年前后推出的DPR。与传统的稀疏检索方法(例如TF-IDF和BM25)相比,DPR是一项重大突破,因为它为查询和文档都引入了密集向量表示。这使得语义检索成为可能,能够基于语义而非仅仅依赖关键词重叠来查找信息。

The foundation of modern retrieval systems can be traced back to early innovations like DPR, introduced by Facebook AI Research (now Meta AI) around 2020. DPR was a major breakthrough compared to traditional sparse retrieval methods (such as TF-IDF and BM25) because it introduced dense vector representations for both queries and documents. This allowed semantic retrieval, finding information based on meaning rather than relying purely on keyword overlap.

密集检索标志着一个重要的转折点:模型现在可以将查询和文档的含义编码到一个共享的嵌入空间中,从而可以高效地计算相似度。密集检索不再匹配精确的词语,而是匹配概念和思想。然而,早期的密集检索器仍然存在局限性:由于语义匹配粗糙,它们有时会检索到不相关的段落;而且,要将它们扩展到数百万甚至数十亿份文档,就需要解决效率和延迟方面棘手的工程难题。

Dense retrieval marked a major turning point: models could now encode the meaning of a query and a document into a shared embedding space where similarity could be computed efficiently. Instead of matching exact words, dense retrieval matched concepts and ideas. However, early dense retrievers still had limitations: they sometimes retrieved irrelevant passages due to coarse semantic matching, and scaling them to millions or billions of documents required solving difficult engineering challenges around efficiency and latency.

稀疏检索

Sparse retrieval

稀疏检索方法,例如 TF-IDF 和 BM25,依赖于精确匹配关键词和词频统计。虽然这些方法出现较早,但在精确度至关重要且查询与特定术语紧密相关的场景中例如法律文件检索、科学文献检索和企业文档检索,它们仍然非常有效,因为在这些场景中,精确匹配比一般的语义相似性更为重要。稀疏检索方法与传统的倒排索引技术结合使用也非常高效,并且在许多实际的搜索系统中仍然是强有力的基础。

Sparse retrieval methods like TF-IDF and BM25 rely on matching exact keywords and term frequency statistics. While older, they remain highly effective in cases where precision is critical and queries are closely tied to specific terminology, such as in legal document search, scientific literature, and enterprise document retrieval, where exact matches matter more than general semantic similarity. Sparse retrieval also scales very efficiently with traditional inverted index techniques and remains a strong baseline in many real-world search systems.

密集检索

Dense retrieval

密集检索方法,例如 DPR 和近似最近邻负对比学习密集文本检索( ANCE )模型的引入,标志着检索方式从稀疏词项匹配技术(例如 BM25)向基于语义向量的检索方式的重大转变。密集检索器在处理开放域搜索、歧义查询或同义词和释义常见的场景时表现出色,例如在客户支持机器人、多语言检索或语义常见问题解答( FAQ ) 匹配中。密集检索使系统能够理解问题背后的意图,即使查询和文档中的确切词语有所不同。下图展示了使用向量数据库的语义检索的基本流程:

Dense retrieval methods, introduced with models like DPR and Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE), marked a major shift from sparse term-matching techniques (e.g., BM25) toward semantic vector-based retrieval. Dense retrievers excel when dealing with open-domain search, ambiguous queries, or when synonyms and paraphrases are common, for example, in customer support bots, multilingual retrieval, or semantic frequently asked questions (FAQs) matching. Dense retrieval allows systems to understand the intent behind a question, even when the exact words differ between the query and the document. The following figure shows the basic flow of semantic retrieval using a vector database:

流程图显示用户提交查询,由嵌入模型处理并与文档块的向量数据库进行匹配,向用户返回相关的向量搜索结果。

图 1.1:使用向量数据库进行语义检索的基本流程

Figure 1.1: Basic flow of semantic retrieval using a vector database

注:为保持清晰简洁,本图将文档分块和嵌入作为整个 RAG 流程的一部分进行展示。实际上,这些步骤(文档分块和嵌入)是在索引阶段离线执行的,而不是在实时查询执行期间执行。本书各章节中所有图表和工作流程均采用此简化方式。

Note: To maintain clarity and simplicity, this figure illustrates document chunking and embedding as part of the overall RAG process. In practice, these steps—chunking and embedding of documents- are performed offline during the indexing phase and not during real-time query execution. This simplification applies across all figures and workflows presented in the chapters of this book.

下图展示了 RAG 流水线的离线阶段。在该阶段,原始文档首先使用语言分块工具(例如基于 Llama 的解析器或 LangChain 工具)进行处理,将其分割成易于管理的片段。然后,这些片段通过嵌入模型(例如 OpenAI 的嵌入 API)生成密集向量表示。生成的嵌入存储在向量数据库中,形成可搜索的索引,为实时查询执行期间的下游检索提供支持。此预处理步骤对于在多模态或基于语言学习模型 (LLM) 的应用中实现快速、可扩展且语义丰富的文档检索至关重要。

The following figure illustrates the offline phase of a RAG pipeline, where raw documents are first processed using language chunking tools (e.g., Llama-based parsers or LangChain utilities) to divide them into manageable segments. These chunks are then passed through an embedding model, such as OpenAI’s embedding API, to generate dense vector representations. The resulting embeddings are stored in a vector database, forming the searchable index that powers downstream retrieval during real-time query execution. This preprocessing step is critical to enabling fast, scalable, and semantically rich document retrieval in multimodal or LLM-based applications.

流程图显示文档如何经过库处理,然后两者都进入嵌入模型,该模型输出到名为“带有向量嵌入的向量数据库”的数据库中。

图 1.2:离线文档索引和嵌入工作流程

Figure 1.2: Offline document indexing and embedding workflow

回顾发展历程,如今的检索系统已经远远超越了早期的DPR架构:

Reflecting on the evolution, today’s retrieval systems have dramatically advanced beyond the early DPR architecture:

  • 混合检索:现代系统越来越多地将稀疏检索和密集检索结合起来(例如,BM25 + 密集嵌入),以平衡召回率和精确率,这在长尾查询或特定领域的知识库中尤其有价值。
  • Hybrid retrieval: Modern systems increasingly combine sparse and dense retrieval (e.g., BM25 + dense embeddings) to balance recall and precision, especially valuable in long-tail queries or domain-specific knowledge bases.
  • 多向量表示:像ColBERT (后期交互模型)这样的高级方法对每个文档编码多个向量而不是单个向量,从而在不牺牲太多速度的情况下提高检索准确率。
  • Multi-vector representations: Advanced methods like ColBERT (late interaction models) encode multiple vectors per document rather than a single one, improving retrieval accuracy without sacrificing too much speed.
  • 检索器-生成器融合(RAG系统):检索不再是一个独立的步骤,而是紧密集成到生成流程中。像RAG这样的模型在推理过程中动态检索文档,并对生成的输出进行条件化处理,从而提高事实准确性并减少错误信息。
  • Retriever-generator fusion (RAG systems): Retrieval is no longer a standalone step; it is now tightly integrated into the generation pipeline. Models like RAG retrieve documents dynamically during inference and condition the generated output, improving factual accuracy and reducing hallucinations.
  • 记忆增强检索:智能人工智能系统利用情景记忆,将外部文档检索与内部学习的知识相结合,从而随着时间的推移不断适应和改进。
  • Memory-augmented retrieval: Agentic AI systems use episodic memory, blending external document retrieval with internally learned knowledge to continuously adapt and improve over time.
  • 学习检索 (LTR) 和上下文检索:一些较新的架构,如RetroRePlug,超越了静态索引,使模型本身能够在推理过程中学习检索策略,并根据推理上下文动态地决定要检索的内容。
  • Learning-to-retrieve (LTR) and in-context retrieval: Some newer architectures like Retro and RePlug move beyond static indexes, enabling the model itself to learn retrieval strategies during inference, deciding what to retrieve based on the reasoning context dynamically.

此外,向量数据库技术发展迅速。Facebook AI Similarity Search ( Faiss )、Milvus、Qdrant、Azure AI Search 和 Pinecone 等工具提供可扩展的高速向量搜索,支持数十亿个嵌入,并具备近似最近邻( ANN ) 算法、元数据过滤和混合检索功能——所有这些对于驱动现代企业级 RAG 系统都至关重要。

Additionally, vector database technology has matured rapidly. Tools like Facebook AI Similarity Search (Faiss), Milvus, Qdrant, Azure AI Search, and Pinecone offer scalable, high-speed vector search, supporting billions of embeddings with approximate nearest neighbor (ANN) algorithms, metadata filtering, and hybrid retrieval capabilities—all critical for powering modern enterprise-grade RAG systems.

必须认识到,如今的检索不再仅仅是获取文档。它已发展成为一种智能增强机制,涉及过滤、重排序、推理和动态知识基础构建。检索正从后端查找服务演变为下一代人工智能的前端推理组件。

It is crucial to recognize that retrieval today is no longer just about fetching documents. It has become an intelligent augmentation mechanism, involving filtering, reranking, reasoning, and dynamic knowledge grounding. Retrieval is evolving from a backend lookup service into a frontline reasoning component of next-generation AI.

因此,深入理解检索,不仅将其视为一种搜索技术,而且将其视为一种智能增强策略,对于构建可靠、可扩展和目标驱动的新一代 GenAI 应用至关重要。

Thus, understanding retrieval deeply, not simply as a search technique but as an intelligent augmentation strategy, is essential for building reliable, scalable, and goal-driven new-age GenAI applications.

检索系统通常基于召回率@k、精确率@k 和平均倒数排名( MRR ) 等指标进行评估,这些指标衡量系统在搜索结果前列中检索相关文档的效率。我们将在后续章节中更详细地介绍检索评估,但现在需要记住的是,检索质量取决于准确率和排序效率。

Retrieval systems are typically evaluated based on metrics like recall@k, precision@k, and Mean Reciprocal Rank (MRR), which measure how effectively the system retrieves relevant documents among the top results. We will cover retrieval evaluation in greater detail later, but for now, it is important to remember that retrieval quality is judged by both accuracy and ranking efficiency.

发电系统

Generation system

正如我们所见,检索系统侧重于查找最相关的现有信息。然而,许多现实世界的任务需要的不仅仅是检索——它们还需要创建推理原创性综合。这正是生成系统发挥作用的地方。

As we have seen, retrieval systems focus on finding the most relevant existing information. However, many real-world tasks demand more than just retrieval—they require creation, reasoning, and original synthesis. This is where generation systems come into play.

在本节中,我们将探讨生成系统的定义、运行机制以及其核心技术。我们将讨论不同类型的生成任务,例如文本、图像和音频的生成,并理解自回归建模、扩散模型和采样策略等关键机制。此外,我们还将介绍温度控制、提示设计以及创造性和事实性之间的平衡等重要概念。

In this section, we will explore what generation systems are, how they operate, and the core techniques that power them. We will discuss different types of generation tasks, such as text, image, and audio creation, and understand key mechanisms like autoregressive modeling, diffusion models, and sampling strategies. Additionally, we will cover important concepts like temperature control, prompt design, and the balance between creativity and factuality.

我们还将探讨生成系统面临的典型挑战,例如幻觉、连贯性问题和安全风险,并重点介绍这些系统真正擅长的领域,尤其是在需要开放式创造力或复杂问题解决能力的任务中。最后,我们将简要介绍现代人工智能架构中如何日益融合检索和生成功能,以构建更贴近现实、更智能的系统。

We will also examine the typical challenges faced by generation systems, such as hallucination, coherence issues, and safety risks, and highlight where these systems truly excel, especially in tasks that demand open-ended creativity or complex problem-solving. Finally, we will briefly introduce how retrieval and generation are increasingly being combined in modern AI architectures to build more grounded and intelligent systems.

让我们首先了解生成系统的基本性质,以及它们与纯粹基于检索的方法有何不同。

Let us begin by understanding the fundamental nature of generation systems and how they differ from purely retrieval-based approaches.

生成系统是一种人工智能模型,其设计目的是生成新内容,而不仅仅是检索现有内容。它们可以通过学习训练数据中的复杂模式来生成文本、图像、音频、代码,甚至是多模态输出。与检索(仅呈现已存在的信息)不同,生成系统使模型能够在推理时动态地编写新句子、创造新图像并解决新问题。

Generation systems are AI models designed to produce new content, rather than simply retrieve it. They can generate text, images, audio, code, and even multimodal outputs by learning complex patterns from training data. Unlike retrieval, which surfaces information that already exists, generation enables models to compose new sentences, invent new images, and solve new problems dynamically at inference time.

现代生成系统通常是大规模神经网络或逻辑层模型(LLM),它们使用跨多个领域的海量数据集,并经过数十亿个参数的训练。下图展示了逻辑层模型和生成模型的类型:

Modern generation systems are typically large-scale neural networks or LLMs trained with billions of parameters on massive datasets across multiple domains. The following figure shows the types of LLMs and generation models:

该图比较了:大型语言模型 (LLM) 回答文本查询、大型动作模型接受查询以给出结果或执行动作,以及大型视觉和语言模型处理图像查询以获得结果。

图 1.3:LLM 的类型和生成模型

Figure 1.3: Types of LLMs and generation models

发电系统类型

Types of generation systems

GenAI系统涵盖多种模态,每种模态都旨在根据用户输入创建文本、图像或音频等内容,展现了现代机器学习ML )模型的多功能性和强大功能。让我们来看看生成系统的类型:

GenAI systems span multiple modalities, each designed to create content such as text, images, or audio based on user input, showcasing the versatility and power of modern machine learning (ML) models. Let us look at the types of generation systems:

  • 文本生成:像 GPT、Llama 和 Claude 这样的模型擅长生成连贯的段落、回答问题、总结文章、翻译语言,甚至创作诗歌和代码。它们是自回归的,这意味着它们会根据之前的词元预测下一个词元——这使得它们能够逐字构建长而有意义的序列。
  • Text generation: Models like GPT, Llama, and Claude specialize in generating coherent paragraphs, answering questions, summarizing articles, translating languages, or even writing poetry and code. They are autoregressive, meaning they predict the next token based on previous tokens—enabling them to build long, meaningful sequences word by word.
  • 图像生成:DALL·E、稳定扩散和 Imagen 等模型可根据文本提示生成图像(文本到图像生成)。这些系统依赖于扩散模型或生成对抗网络 (GAN) 等技术,根据用户指令,迭代地从随机噪声中创建逼真的图像。
  • Image generation: Models like DALL·E, Stable Diffusion, and Imagen generate images from text prompts (text-to-image generation). These systems rely on techniques like diffusion models or GANs to iteratively create realistic images from random noise, conditioned on user instructions.
  • 音频生成:在音频生成领域,像 Whisper(用于自动语音识别)和 VALL-E(用于语音合成)这样的模型可以生成类似人类的语音,甚至可以创作音乐。这些模型学习声波的表征,并根据文本输入识别语音(自动语音识别)或生成音频。
  • Audio generation: In audio generation, models like Whisper (for ASR) and VALL-E (for speech synthesis) produce human-like speech or even create music. These models learn representations of sound waves and either recognize speech (ASR) or generate audio based on text inputs.

该生成过程背后的核心技术如下:

Core techniques behind the generation are as follows:

  • 语言模型:语言模型经过训练,可以根据先前的序列预测下一个词(标记),因此它们被称为自回归模型,如图 1.3所示。像 GPT-3/4/o3、Llama 或 Claude 这样的大型模型通过自监督学习来学习上下文关系和世界知识,从而能够完成各种生成任务,例如回答问题、总结文档和创意写作。
  • Language models: Language models are trained to predict the next word (token) given a previous sequence, and so they are called autoregressive models, as explained in Figure 1.3. Large models like GPT-3/4/o3, Llama, or Claude learn contextual relationships and world knowledge through self-supervised learning, enabling diverse generation tasks such as answering questions, summarizing documents, and creative writing.
  • 视觉模型:诸如 DALL·E 和稳定扩散之类的模型将类似 Transformer 的架构应用于图像块或潜在表示,从而实现文本到图像的生成。它们捕捉潜在空间中视觉元素的结构、风格和内容。
  • Vision models: Models like DALL·E and Stable Diffusion apply transformer-like architectures to image patches or latent representations, allowing text-to-image generation. They capture the structure, style, and content of visual elements in latent spaces.
  • 扩散模型:扩散模型从随机噪声开始,通过迭代去噪来生成逼真的样本。它们常用于生成高保真图像(例如,稳定扩散模型和 Imagen 模型),也被应用于音频甚至 3D 模型生成。扩散模型正被积极应用于语言任务,但与基于 Transformer 的模型(例如 GPT)相比,它们仍不够成熟,也未占据主导地位。语言扩散模型领域正在迅速发展,多项研究表明,基于扩散的生成模型可以与自回归语言模型相媲美,甚至形成互补。
  • Diffusion models: Diffusion models start with random noise and iteratively denoise it to create a realistic sample. Popular for generating high-fidelity images (e.g., Stable Diffusion, Imagen), they have also been adapted for audio and even 3D model generation. Diffusion models are being actively adapted for language tasks, though they are still less mature and less dominant than transformer-based models (like GPT). The field of language diffusion models is rapidly evolving, and several research efforts have shown that diffusion-based generative models can be competitive with or complementary to autoregressive language models.

自回归生成

Autoregressive generation

在自回归模型(例如 GPT)中,每个输出标记都是逐个生成的,并以先前生成的标记为条件。这种逐个标记的顺序生成方式使得模型能够产生高度一致的输出,但如果管理不当,也可能导致误差累积。下图解释了 LLM 如何以自回归的方式(一次生成一个标记)进行生成:

In autoregressive models (like GPT), each output token is generated one at a time, conditioned on previously generated tokens. This sequential token-by-token generation allows models to produce highly coherent outputs, but can also lead to error accumulation if not managed carefully. The following figure explains how LLM generates in an autoregressive manner (one token at a time):

图示展示了文本输入如何被转换为词元(token),然后转换为词元 ID,最后输入到语言模型中。该模型输出每个词的向量表示,并突出显示了“Today”、“is”和“a”。

图 1.4:以自回归方式生成 LLM(一次生成一个标记

Figure 1.4: LLM generation in an autoregressive manner (one token at a time)

以下是温度和采样策略:

The following are the temperature and sampling strategies:

  • 温度:控制生成的随机性。温度越低,输出结果越确定、越客观;温度越高输出结果越具创造性、越多样化。
  • Temperature: Controls the randomness of the generation. Lower temperature | more deterministic and factual outputs. Higher temperature | more creative and diverse outputs.
  • Top-k 抽样:将下一个标记的选择限制在最有可能的前 k 个标记中。
  • Top-k sampling: Limits the next token choice to the top-k most probable tokens.
  • Top-p(核心)采样:从累积概率超过 top-p 的最小标记集合中进行选择。
  • Top-p (nucleus) sampling: Selects from the smallest set of tokens whose cumulative probability exceeds top-p.

调整这些参数可以对人工智能生成过程中的创造性和精确性进行精细控制。

Tuning these parameters allows fine control over creativity vs. precision in AI generation.

提示策略

Prompting strategies

提示对于引导生成系统的行为至关重要。诸如CoT之类的高级提示技术能够通过鼓励模型在回答问题之前解释其思考过程,从而实现多步骤推理。我们将在下一节中详细解释这些技术。

Prompts are critical for steering the behavior of generation systems. Advanced prompting techniques like CoT enable multi-step reasoning by encouraging models to explain their thought process before answering. We will explain these in more detail in the next section.

了解发电系统的优势所在

Understanding where generation systems excel

发电系统在以下方面尤其强大:

Generation systems are particularly powerful in the following:

  • 开放式创意任务(讲故事、创作图像、写诗、编程)。
  • Open-ended creativity tasks (storytelling, image creation, poetry, coding).
  • 超越检索能力的复杂推理和问题解决能力。
  • Complex reasoning and problem-solving beyond retrieval capabilities.
  • 个性化和动态响应生成(聊天机器人、教育辅导员)。
  • Personalization and dynamic response generation (chatbots, educational tutors).
  • 弥补现有数据无法完全匹配查询条件的空白。
  • Bridging gaps where no pre-existing data exactly fits the query.

结合检索和生成

Combining retrieval and generation

尽管内容生成系统在创造新内容方面功能强大,但它们有时难以保证事实准确性、知识更新及时,也无法将输出结果与现实世界的信息联系起来。为了克服这些挑战,现代人工智能架构越来越多地将检索和生成功能的优势结合起来,从而催生了一种被称为 RAG 的强大范式。

While generation systems are incredibly powerful at creating new content, they sometimes struggle with factual accuracy, up-to-date knowledge, and grounding their outputs in real-world information. To overcome these challenges, modern AI architectures increasingly combine the strengths of retrieval and generation, giving rise to a powerful paradigm known as RAG.

在下一节中,我们将探讨 RAG 系统如何工作,为什么它们对于构建可靠的 AI 应用程序至关重要,以及它们如何将检索和生成无缝集成到统一的智能工作流程中。

In the next section, we will explore how RAG systems work, why they are critical for building reliable AI applications, and how they seamlessly integrate retrieval and generation into a unified, intelligent workflow.

检索增强生成

Retrieval-augmented generation

RAG 是一种先进的人工智能架构,它将检索生成整合到一个统一的工作流程中。RAG 系统并非仅仅依赖模型内部的知识(这些知识可能过时或不完整),而是首先检索相关的外部信息,然后根据检索到的内容生成答案。

RAG is an advanced AI architecture that combines retrieval and generation into a unified workflow. Instead of relying solely on a model's internal knowledge (which may be outdated or incomplete), a RAG system first retrieves relevant external information and then generates an answer conditioned on that retrieved content.

RAG 的出现是为了解决纯发电模式面临的主要挑战,这些挑战如下:

RAG emerged to address key challenges faced by pure generation models, which are as follows:

  • 幻觉:它有时会产生捏造的、听起来合情合理但实际上不正确的输出。
  • Hallucination: It sometimes generates fabricated, plausible-sounding but incorrect outputs.
  • 过时的知识:预训练模型的知识库是静态的(截止日期)。
  • Stale knowledge: Pre-trained models have a static knowledge base (cutoff dates).
  • 真实性:用户通常要求输出结果与可验证的、现实世界的信息相关联。
  • Groundedness: Users often demand outputs linked to verifiable, real-world information.

RAG 弥合了这些差距,使输出结果更加准确、贴近实际、更及时

RAG bridges these gaps, making outputs more accurate, grounded, and up-to-date.

RAG 工作

RAG working

RAG系统通常包括以下两个主要步骤:

A RAG system typically involves two major steps, which are as follows:

  1. 检索步骤:给定用户查询,系统首先从外部知识库(例如,向量数据库)检索前 k 个最相关的文档或数据块。
  2. Retrieval step: Given a user query, the system first retrieves the top-k most relevant documents or chunks from an external knowledge base (e.g., a vector database).
  3. 生成步骤:将检索到的文档作为上下文传递给语言模型(LLM),LLM 根据检索到的信息生成最终答案。
  4. Generation step: The retrieved documents are passed as context to a language model (LLM), which generates the final answer conditioned on the retrieved information.

因此,该模型并非仅由记忆生成;它是先读取,然后推理

Thus, the model is not generated from memory alone; it is reading first, then reasoning.

基本 RAG 流水线的架构

Architecture of a basic RAG pipeline

以下列表概述了基本的 RAG 流水线结构:

The following list outlines how a basic RAG pipeline looks like:

  • 查询理解:对输入的查询进行处理,可以选择重新措辞或扩展,以优化检索。
  • Query understanding: The input query is processed, optionally rephrased or expanded, to optimize retrieval.
  • 检索:密集型或混合型检索器从向量数据库或搜索引擎中检索最相关的文档。
  • Retrieval: A dense or hybrid retriever fetches the most relevant documents from a vector database or search engine.
  • 上下文准备:对检索到的文档进行选择、截断、分块和格式化,使其适合 LLM 的输入上下文窗口。
  • Context preparation: Retrieved documents are selected, truncated, chunked, and formatted to fit within the LLM’s input context window.
  • 生成:LLM 会收到原始查询和检索到的文档,从而生成一个有理有据、上下文丰富的响应。
  • Generation: The LLM is prompted with both the original query and the retrieved documents to generate a grounded, contextually rich response.
  • 输出交付:将模型的最终响应返回给用户。
  • Output delivery: The model's final response is returned to the user.

RAG架构的类型

Types of RAG architectures

如今,根据检索和生成过程的协调方式,出现了许多不同类型的 RAG 架构。但是,为了保持讨论范围的集中性,以下介绍两种最常见且实用的架构:

There are many different types of RAG architectures evolving today, depending on how retrieval and generation are orchestrated. However, to keep the scope focused, the following are the two most common and practical ones:

  • 单级 RAG
    • 一个简单的流程:检索|生成。
    • 适用于检索质量高且延迟要求低的情况。

    下图展示了一种单级 RAG 架构:

  • Single-stage RAG:
    • A simple pipeline: retrieve | generate.
    • Used when retrieval quality is high and latency needs to be minimal.

    The following figure shows a single-stage RAG architecture:

流程图显示用户查询系统;他们的查询和文档块被嵌入到矢量数据库中,并由 LLM 处理以生成返回给用户的结果。

图 1.5:单级 RAG 架构

Figure 1.5: Single-stage RAG architecture

  • 两阶段 RAG
    • 检索|重排序|生成。
    • 初始检索之后,第二个模型(例如交叉编码器)对文档进行重新排序以提高质量,然后再将其传递给生成器。
    • 通过仅将生成过程集中于最相关的文档,减少幻觉。
  • Two-stage RAG:
    • Retrieval | reranking | generation.
    • After initial retrieval, a second model (e.g., cross-encoder) reranks documents to improve the quality before passing them to the generator.
    • Reduces hallucination by focusing the generation only on the most relevant documents.

下图展示了一种两阶段 RAG 架构:

The following figure shows a two-stage RAG architecture:

检索增强生成 (RAG) 流程图,展示了用户输入如何通过嵌入模型进行处理,存储在向量数据库中,重新排序,并通过具有输入和输出保护机制的大型语言模型 (LLM) 生成结果。

图 1.6:两阶段 RAG 架构

Figure 1.6: Two-stage RAG architecture

迭代 RAG

Iterative RAG

以下是两种迭代式 RAG 算法:

The following are the two iterative RAG:

  • 检索和生成过程跨越多个回合。
  • Retrieval and generation happen across multiple turns.
  • 如果第一批文档不足,该模型可以动态检索其他文档,逐步完善答案。
  • The model can retrieve additional documents dynamically if the first batch is insufficient, refining the answer step-by-step.

向量数据库和 RAG

Vector databases and RAG

向量数据库是高效 RAG 系统的关键基础设施。

Vector databases are critical infrastructure for efficient RAG systems.

  • 用途:它们存储文档嵌入,并基于向量相似性实现快速语义搜索。
  • Purpose: They store document embeddings and enable fast semantic search based on vector similarity.
  • 示例:Faiss (Meta)、Qdrant、Milvus、Pinecone、Weaviate。
  • Examples: Faiss (Meta), Qdrant, Milvus, Pinecone, Weaviate.

ANN 算法具有可扩展性,能够快速找到足够接近的向量,而不是精确匹配,从而实现对数百万或数十亿份文档的实时检索。

ANN algorithms are used for scalability, finding close enough vectors quickly rather than exact matches, enabling real-time retrieval over millions or billions of documents.

向量存储还允许元数据过滤(例如,日期、作者)和分片以实现分布式检索,这对于扩展企业 RAG 系统至关重要。

Vector stores also allow metadata filtering (e.g., date, author) and sharding for distributed retrieval, essential for scaling enterprise RAG systems.

RAG 的快速工程

Prompt engineering for RAG

检索到的内容的格式及其输入到 LLM 的方式会对输出质量产生重大影响。

How the retrieved content is formatted and fed into the LLM significantly affects output quality.

关键技术包括以下几点:

Key techniques include the following:

  • 分块:将大篇幅文档拆分成小段,以便将多个段落放入提示中。
  • Chunking: Breaking large documents into smaller pieces to fit multiple passages into the prompt.
  • 窗口化:在文档上滑动一个固定大小的窗口,以捕获关键字周围的局部上下文。
  • Windowing: Sliding a fixed-size window over documents to capture local context around keywords.
  • 上下文管理:在不超出模型令牌限制的情况下选择最相关的块。
  • Context management: Selecting the most relevant chunks without exceeding the model’s token limit.

精心设计的提示确保语言学习者在生成过程中专注于最重要的信息。

Well-constructed prompts ensure the LLM focuses on the most important information during generation.

高级 RAG 技术

Advanced RAG techniques

随着 RAG 系统的不断发展,人们正在开发先进技术来提高检索质量、提升响应准确率并实现更具上下文感知能力的答案生成。以下是一些先进的 RAG 技术:

As RAG systems evolve, advanced techniques are being developed to enhance retrieval quality, improve response accuracy, and enable more context-aware generation. The following are some of the advanced RAG techniques:

  • RAG 重排序
    • 在生成之前,使用重排序器(如交叉编码器)根据细粒度的相关性评分对检索到的文档进行评估和重新排序。
    • 如果优化得当,可以在不显著增加检索时间的情况下提高精度。
  • RAG with reranking:
    • Use a reranker (like a cross-encoder) to evaluate and reorder the retrieved documents based on fine-grained relevance scoring before generation.
    • Improves precision without significantly increasing retrieval time if optimized properly.
  • 记忆增强型 RAG
    • 检索不仅来自静态知识库,还来自情景记忆——存储过去的对话片段或学习经验。
    • 在多轮对话系统中实现动态、个性化和上下文感知的响应。
  • Memory-augmented RAG:
    • Retrieval is not only from static knowledge bases but also from episodic memories-storing past conversation snippets or learned experiences.
    • Enables dynamic, personalized, and context-aware responses in multi-turn dialogue systems.
  • 多模态 RAG
    • 扩展 RAG 功能,使其能够检索文本和图像(或视频、音频)。
    • 例如:在医疗助理岗位上,同时检索 X 光片和病人病历,并将两者输入到 GPT-4V 或 Flamingo 等多模态模型中。
  • Multimodal RAG:
    • Extend RAG to retrieve both text and images (or videos, audio).
    • Example: In a medical assistant role, retrieve x-rays and patient notes together, feeding both into a multimodal model like GPT-4V or Flamingo.

RAG的应用

Applications of RAG

RAG系统已在各行各业迅速普及。让我们来了解一下它的应用:

RAG systems have rapidly gained adoption across industries. Let us understand its applications:

  • 企业聊天机器人:基于公司知识库的客户服务机器人。
  • Enterprise chatbots: Customer service bots grounded in company knowledge bases.
  • 文档质量保证系统:回答来自大型语料库(如研究论文、法律文件或技术手册)的查询。
  • Document QA systems: Answering queries from large corpora like research papers, legal documents, or technical manuals.
  • 知识管理:组织和动态实时访问企业知识。
  • Knowledge management: Organizing and dynamically accessing enterprise knowledge in real-time.
  • 个性化人工智能助手:根据用户特定的文档、电子邮件、笔记等定制回复。
  • Personalized AI assistants: Tailoring responses based on user-specific documents, emails, notes, etc.

在任何情况下,RAG 都能确保 AI 系统产生可靠、可验证和有理有据的输出。

In every case, RAG ensures the AI system produces reliable, verifiable, and grounded outputs.

人工智能系统中的编排

Orchestration in AI systems

随着人工智能系统变得日益复杂,尤其是在红绿灯(RAG)和智能体人工智能系统兴起之后,智能编排的需求变得至关重要。编排指的是如何动态地管理、排序和协调不同的组件,例如检索引擎、语言模型、内存模块和外部工具,以实现特定目标。

As AI systems become increasingly complex, especially with the rise of RAG and agentic AI systems, the need for intelligent orchestration has become critical. Orchestration refers to how different components, such as retrieval engines, language models, memory modules, and external tools, are managed, sequenced, and coordinated dynamically to achieve a specific goal.

与传统的单次调用 LLM 应用不同,RAG 系统和代理系统涉及多步骤推理和动态决策,需要复杂的编排框架。

Unlike traditional single-call LLM applications, RAG systems and agentic systems involve multi-step reasoning and dynamic decision-making, requiring sophisticated orchestration frameworks.

RAG 系统中的编排

Orchestration in RAG systems

在 RAG 系统中,编排涉及以下内容:

In RAG systems, orchestration involves the following:

  • 查询理解:在检索之前预处理用户查询。
  • Query understanding: Preprocessing user queries before retrieval.
  • 文档检索:与矢量数据库(例如 Faiss、Qdrant、Pinecone)对接,获取前 k 个相关文档。
  • Document retrieval: Interfacing with vector databases (e.g., Faiss, Qdrant, Pinecone) to fetch top-k relevant documents.
  • 上下文准备:对检索到的文档进行分块、选择和格式化,使其适合 LLM 的上下文窗口。
  • Context preparation: Chunking, selecting, and formatting retrieved documents to fit within the LLM’s context window.
  • 提示构建:将检索到的知识动态地插入到结构良好的提示中。
  • Prompt construction: Dynamically inserting retrieved knowledge into well-structured prompts.
  • 响应生成:使用 LLM 生成基于所提供文档的输出。
  • Response generation: Using the LLM to generate outputs grounded in the provided documents.
  • 后处理(可选):过滤、重新排序或验证模型输出。
  • Post-processing (optional): Filtering, reranking, or verifying model outputs.

LangChain、LlamaIndex 和 Haystack 等框架专门用于自动编排这些步骤,从而更容易构建可扩展且可用于生产的 RAG 管道。

Frameworks like LangChain, LlamaIndex, and Haystack specialize in orchestrating these steps automatically, making it easier to build scalable and production-ready RAG pipelines.

下图解释了 LangChain 如何协调整个 RAG 流程:

The following figure explains how LangChain is orchestrating the entire RAG process:

流程图显示,嵌入模型处理查询,将结果存储在向量数据库中,检索并重新排序,然后连同查询一起发送到 LLM 以生成结果,最终将结果返回给用户。

图 1.7:粗线条由 LangChain 或类似的控制器控制。

Figure 1.7: The fat lines are orchestrated by LangChain or similar orchestrators

良好的 RAG 流程编排可确保以下几点:

Good RAG orchestration ensures the following:

  • 最低延迟
  • Minimal latency
  • 高检索质量
  • High retrieval quality
  • 检索和生成之间的紧密耦合
  • Tight coupling between retrieval and generation
  • 对令牌限制和内存的稳健处理
  • Robust handling of token limits and memory

智能体系统中的编排

Orchestration in agentic systems

在智能体系统中,编排变得更加动态。

In agentic systems, orchestration becomes even more dynamic.

智能体是一种人工智能实体,能够执行以下操作:

An agent is an AI entity capable of the following:

  • 对任务进行推理。
  • Reasoning about a task.
  • 选择操作(例如,工具使用、API 调用、检索)。
  • Choosing actions (e.g., tool usage, API calls, retrievals).
  • 逐步执行操作。
  • Executing actions step-by-step.
  • 根据中期结果动态地反思和调整其计划。
  • Reflecting and adjusting its plan dynamically based on intermediate results.

智能体编排包括以下几个方面:

Agentic orchestration involves the following:

  • 工具选择:根据当前目标决定要调用哪些外部工具或函数。
  • Tool selection: Deciding which external tools or functions to call based on the current goal.
  • 多步骤规划:按逻辑顺序采取行动,以解决复杂问题。
  • Multi-step planning: Sequencing actions logically toward solving complex problems.
  • 记忆管理:保留过去的操作、中间结果和对话历史记录,以指导未来的步骤。
  • Memory management: Retaining past actions, intermediate results, and conversation history to guide future steps.
  • 错误处理:如果操作失败,则进行重试或优雅恢复。
  • Error handling: Retrying or recovering gracefully if an action fails.
  • 目标管理:持续检查最终任务目标是否已实现。
  • Goal management: Continuously checking whether the final task objective has been met.

LangChain Agents、LamaIndex、Haystack 等框架为代理系统提供编排原语,使 AI 模型能够像自主的多步骤决策者一样行事。

Frameworks like LangChain Agents, LamaIndex, Haystack, etc., provide orchestration primitives for agentic systems, allowing AI models to behave like autonomous, multi-step decision-makers.

良好的代理编排可确保以下几点:

Good agentic orchestration ensures the following:

  • 将任务分解为可管理的步骤。
  • Task decomposition into manageable actions.
  • 动态适应不可预见的结果。
  • Dynamic adaptation to unforeseen outcomes.
  • 类似人类的推理链,跨越多个工具交互。
  • Human-like reasoning chains that span multiple tool interactions.

编排侧重于管理复杂人工智能系统的整体流程,而另一个基础概念则在更低的层面上运作——即信息本身在模型内部的表示和处理方式。在进行任何检索、生成或推理之前,输入文本必须被分解成模型可以理解的形式——这个过程称为分词

While orchestration focuses on managing the overall flow of complex AI systems, another foundational concept operates at a much lower level - how information itself is represented and processed inside models. Before any retrieval, generation, or reasoning can happen, input text must be broken down into a form that models can understand—a process known as tokenization.

要充分理解人工智能系统的能力和局限性,就必须了解什么是令牌、令牌化是如何工作的,以及为什么它在塑造性能、成本和设计选择方面起着至关重要的作用。

To fully appreciate the capabilities and limitations of AI systems, it is essential to understand what tokens are, how tokenization works, and why it plays a critical role in shaping performance, cost, and design choices.

现在让我们来了解一下分词。

Let us now understand tokenization.

人工智能系统中的令牌

Tokens in AI systems

在现代人工智能系统,特别是语言学习模型(LLM)中,词元(token)的概念对于输入和输出的处理至关重要。词元不一定是一个词;它可以是一个词、词的一部分(子词),甚至是标点符号和特殊字符,具体取决于模型的词元生成器。

In modern AI systems, particularly LLMs, the concept of tokens is fundamental to how inputs and outputs are processed. A token is not necessarily a word; it can be a word, a part of a word (subword), or even punctuation and special characters, depending on the model’s tokenizer.

分词是将文本分解成模型可以理解和处理的离散单元的过程。像 GPT-3、GPT-4 和 Llama 这样的模型并不直接处理原始文本,而是处理词元序列。

Tokenization is the process of breaking down text into discrete units that the model can understand and process. Models like GPT-3, GPT-4, and Llama do not operate directly on raw text; they operate on sequences of tokens.

代币化策略有多种类型,例如以下几种:

There are different types of tokenization strategies, like the following:

  • 词级分词:在早期模型中,一个词元通常对应一个完整的单词(例如,dog running )。这种方法虽然简单,但处理罕见词或复合词时效率低下。
  • Word-level tokenization: In early models, one token often corresponded to one full word (e.g., dog, running). This approach is simple but inefficient for handling rare or compound words.
  • 子词级分词 [字节对编码(BPE)] :被大多数现代语言模型(包括 GPT、BERT)采用。它将常用词根合并(例如,run + ning | running ),以减少词汇量,同时仍能高效地处理罕见词。
  • Subword-level tokenization [Byte-Pair Encoding (BPE)]: Used by most modern LLMs (including GPT, BERT). Common word parts are merged (e.g., run + ning | running) to reduce the vocabulary size while still handling rare words efficiently.
  • 字符级分词:每个字符(包括空格和标点符号)都被视为一个词元。这种方法会显著增加序列长度,并且由于计算成本高昂,在大型模型中不太常用。
  • Character-level tokenization: Each character (including spaces and punctuation) is treated as a token. It increases sequence length dramatically and is less common in large models due to computational cost.
  • 字节级分词:有些模型基于原始字节进行分词(例如,GPT-2 使用字节级 BPE),无需复杂的预分词即可实现灵活的多语言处理。
  • Byte-level tokenization: Some models tokenize based on raw bytes (e.g., GPT-2 uses byte-level BPE), allowing flexible multilingual processing without complex pre-tokenization.

代币数量决定:

The number of tokens determines:

  • 费用(例如 OpenAI 的 GPT-4 等 API 的定价是按 1,000 个代币计算的)。
  • The cost (for APIs like OpenAI's GPT-4 pricing is per 1,000 tokens).
  • 上下文窗口大小(模型一次可以看到多少信息)。
  • The context window size (how much information the model can see at once).
  • 如果输入超过最大标记限制(例如,GPT-4 turbo 最大支持 128K 个标记),则会发生模型行为和截断。
  • Model behavior and the truncation happen if input exceeds maximum token limits (e.g., GPT-4 turbo supports 128K tokens max).

因此,了解令牌、令牌代表什么以及如何计数对于优化性能、控制生成长度、管理成本和设计有效的提示工程策略至关重要。

Thus, understanding tokens, what they represent, and how they are counted is critical for optimizing performance, controlling generation length, managing costs, and designing effective prompt engineering strategies.

例如,输入“Today is a beautiful day outside .”可能会被拆分成子词,如To, day, is, a, be, aut, iful , day , out , side,具体取决于分词器。

For example, the input Today is a beautiful day outside. might be split into subwords like (To, day, is, a, be, aut, iful, day, out, side) depending on the tokenizer.

词元被分割成词元后,每个词元都会使用词汇表(在模型训练期间预先构建)映射到唯一的词元 ID。每个词元 ID 都对应一个模型内部可以理解的整数。例如:

Once split into tokens, each token is then mapped to a unique token ID using a vocabulary table (pre-built during model training). Each token ID corresponds to an integer that the model understands internally. For instance:

  • | 98
  • To | 98
  • 1452
  • day | 1452
  • 美丽(拆分成子词)| 2932 和 1709
  • beautiful (split into subwords) | 2932 and 1709

因此,整个输入序列被转换为标记 ID 向量——模型可以对其进行操作的数字列表。

Thus, the entire input sequence is transformed into a vector of token IDs—a list of numbers that the model can operate on.

此时,词元 ID 会经过一个嵌入层。该嵌入层将每个词元 ID 转换为一个高维向量(例如,768 维向量),该向量能够捕捉词元之间的语义关系。语义相关的词元(例如,“狗”“小狗”)在向量空间中的嵌入向量会非常接近。

At this point, the token IDs are passed through an embedding layer. This embedding layer converts each token ID into a high-dimensional vector (e.g., 768-dimensional) that captures semantic relationships between tokens. Tokens that are semantically related (e.g., dog and puppy) will have embeddings that are close in vector space.

从那里开始,词元嵌入会流经模型的架构、注意力层、Transformer 模块,最终生成输出或进行推理。

From there, the token embeddings move through the model’s architecture, the attention layers, transformer blocks, and eventually lead to output generation or reasoning.

总而言之,分词弥合了人类语言和机器理解之间的鸿沟。它将杂乱无章、长度不一的人类文本转换为标准化的数字。深度学习模型可以高效处理的形式。如果没有分词,现代语言模型将无法处理人类交流的复杂性和多样性。下图展示了现代语言模型中的分词流程:

In summary, tokenization bridges the gap between human language and machine understanding. It translates messy, variable-length human text into standardized numerical forms that deep learning models can efficiently process. Without tokenization, modern language models would not be able to handle the complexity and diversity of human communication. The following figure shows the tokenization process flow in modern language models:

图表显示文本输入被转换为标记,然后转换为标记 ID,语言模型处理这些标记 ID,生成一系列彩色条形的向量表示。

图 1.8:分词流程图

Figure 1.8: Tokenization process flow

向量数据库

Vector database

虽然分词使语言模型能够以细粒度的方式处理和理解文本输入,但处理大规模检索任务需要另一种表示方法。检索系统不再直接处理词元,而是基于密集向量嵌入——一种能够捕捉文本、图像或其他数据类型语义的数学表示。为了高效地存储、搜索和检索这些嵌入,向量数据库已成为现代人工智能架构的重要组成部分。

While tokenization enables language models to process and understand text inputs at a granular level, handling large-scale retrieval tasks requires a different kind of representation. Instead of working directly with tokens, retrieval systems operate on dense vector embeddings—mathematical representations that capture the semantic meaning of text, images, or other data types. To store, search, and retrieve these embeddings efficiently, vector databases have become an essential component of modern AI architectures.

现在让我们来探讨向量数据库的作用,以及它们如何为可扩展的高性能检索系统提供支持。

Let us now explore the role of vector databases and how they power scalable, high-performance retrieval systems.

在进一步探讨向量数据库之前,首先需要了解它们在其他类型数据库中的定位。

Before we explore vector databases further, it is important to first understand where they fit among other types of databases.

让我们来看一下数据库的类型,它们如下:

Let us look at the types of databases, which are as follows:

  • 关系型数据库:将数据组织成由行和列组成的结构化表格;最适合使用 SQL 查询的结构化数据。(例如,MySQL、PostgreSQL)
  • Relational databases: Organize data into structured tables of rows and columns; best for structured data with SQL querying. (e.g., MySQL, PostgreSQL)
  • 键值存储:存储简单的键值对;针对快速查找进行了优化。(例如,Redis、DynamoDB)。
  • Key-value stores: Store simple key-value pairs; optimized for fast lookups. (e.g., Redis, DynamoDB).
  • 文档数据库:以灵活的格式(例如JavaScript 对象表示法( JSON ))管理半结构化数据。(例如,MongoDB、CouchDB)。
  • Document databases: Manage semi-structured data in flexible formats like JavaScript Object Notation (JSON). (e.g., MongoDB, CouchDB).
  • 图数据库:使用节点和边存储实体之间的关系。(例如,Neo4j、ArangoDB)。
  • Graph databases: Store relationships between entities using nodes and edges. (e.g., Neo4j, ArangoDB).
  • 宽列数据库:将数据组织成列族,以便高效地进行大规模读/写操作。(例如,Cassandra、HBase)。
  • Wide-column databases: Organize data into column families for efficient large-scale reads/writes. (e.g., Cassandra, HBase).
  • 内存数据库:将数据存储在随机存取存储器RAM )中,以实现超快访问。(例如,Redis、Memcached)。
  • In-memory databases: Store data in random access memory (RAM) for ultra-fast access. (e.g., Redis, Memcached).
  • 时间序列数据库:专门用于存储顺序的、带时间戳的数据。(例如,InfluxDB、TimescaleDB)。
  • Time-series databases: Specialized for sequential, time-stamped data. (e.g., InfluxDB, TimescaleDB).
  • 文本搜索数据库:针对全文索引和搜索进行了优化。(例如,Elasticsearch、Solr)。
  • Text search databases: Optimized for full-text indexing and search. (e.g., Elasticsearch, Solr).
  • 空间数据库:存储地理/空间数据。(例如,PostGIS)。
  • Spatial databases: Store geographic/spatial data. (e.g., PostGIS).
  • Blob 存储:管理大型二进制文件,例如图像和视频。[例如,Amazon Simple Storage Service ( S3 )]。
  • Blob stores: Manage large binary files like images and videos. [e.g., Amazon Simple Storage Service (S3)].
  • 账本数据库:不可变记录保存[例如,Hyperledger Fabric、Amazon Quantum Ledger Database ( QLDB )]。
  • Ledger databases: Immutable record-keeping [e.g., Hyperledger Fabric, Amazon Quantum Ledger Database (QLDB)].
  • 层次数据库:树状父子结构数据。[例如,IBM信息管理系统( IMS )]
  • Hierarchical databases: Tree-like parent-child structured data. [e.g., IBM Information Management System (IMS).
  • 向量数据库:存储用于语义搜索和相似性操作的高维嵌入。(例如,Faiss、Chroma、Pinecone)。
  • Vector databases: Store high-dimensional embeddings for semantic search and similarity operations. (e.g., Faiss, Chroma, Pinecone).

其中,向量数据库已成为人工智能、检索和智能推理系统不可或缺的一部分。

Among these, vector databases have become essential for AI, retrieval, and agentic reasoning systems.

了解向量数据库

Understanding vector databases

向量数据库旨在存储和检索稠密向量嵌入,即文本、图像或音频等非结构化数据的数值表示。与关系型数据库或文档型数据库不同,向量数据库基于余弦相似度或欧氏距离等距离度量进行相似性搜索,而不是进行精确匹配。

Vector databases are designed to store and retrieve dense vector embeddings, numerical representations of unstructured data such as text, images, or audio. Unlike relational or document databases, vector databases perform similarity searches based on distance metrics like cosine similarity or Euclidean distance rather than exact matching.

它们使 AI 模型能够高效地检索语义相似的项目,这是 RAG 和内存增强型代理系统的关键操作。

They enable AI models to retrieve semantically similar items efficiently, a critical operation for RAG and memory-augmented agentic systems.

向量数据库中的索引算法

Indexing algorithms in vector databases

高效的向量搜索是人工智能驱动应用的基础,但对数百万个高维向量进行穷举搜索计算量巨大。为了解决这一问题,我们需要……为了应对这一挑战,向量数据库使用专门的索引算法来提高搜索速度,同时兼顾准确性和内存效率。

Efficient vector search is fundamental to AI-driven applications, but performing brute-force searches over millions of high-dimensional vectors is computationally intensive. To address this challenge, vector databases use specialized indexing algorithms that improve search speed while balancing accuracy and memory efficiency.

以下是一些常用的索引技术:

The following are some commonly used indexing techniques:

  • 扁平搜索(暴力搜索):每个查询都会与所有存储的向量进行比较。
    • 提供最高精度。
    • 速度慢,且无法扩展到大型数据集。
  • Flat (brute-force search): Every query is compared against all stored vectors.
    • Offers the highest accuracy.
    • Slow and not scalable for large datasets.
  • 倒排文件索引(IVF):向量被分组到簇中,搜索仅限于最相关的簇。
    • 比蛮力方法快得多。
    • 在保持良好精度的同时,提高了效率。
  • Inverted File Indexing (IVF): Vectors are grouped into clusters, and searches are restricted to the most relevant clusters.
    • Significantly faster than brute-force.
    • Maintains good accuracy with improved efficiency.
  • 分层可导航小世界(HNSW):一种基于图的结构,能够实现快速准确的ANN搜索。
    • 非常适合实时搜索。
    • 兼顾高记忆率和低延迟。
  • Hierarchical navigable small world (HNSW): A graph-based structure that enables fast and accurate ANN searches.
    • Well-suited for real-time search.
    • Balances high recall with low-latency.
  • 乘积量化(PQ):将向量压缩成紧凑的表示形式,以实现可扩展的搜索。
    • 适用于超大型数据集。
    • 存储和检索效率高,准确度也相当不错。
  • Product quantization (PQ): Compresses vectors into compact representations for scalable search.
    • Suitable for very large datasets.
    • Efficient in both storage and retrieval with reasonable accuracy.

每种索引方法在速度、准确性和内存使用方面都各有侧重。最佳选择取决于应用程序的具体需求,包括数据集大小、性能要求和基础设施限制。

Each of these indexing methods offers a different balance between speed, accuracy, and memory usage. The best choice depends on the specific needs of the application, including dataset size, performance requirements, and infrastructure constraints.

向量数据库中的搜索算法

Search algorithms in vector databases

向量数据库旨在高效地存储和检索高维向量嵌入,这对于驱动现代人工智能应用(例如语义搜索、推荐系统和图像相似度分析)至关重要。数据被编码成向量嵌入并建立索引后,即可使用搜索算法检索给定查询向量的最近邻向量。

Vector databases are designed to store and retrieve high-dimensional vector embeddings efficiently, key to powering modern AI applications like semantic search, recommendation systems, and image similarity. Once data is encoded into vector embeddings and indexed, search algorithms are used to retrieve the nearest neighbors to a given query vector.

两种主要方法如下:

The two main approaches are as follows:

  • 精确搜索:这种方法将查询向量与数据库中的每个向量进行比较,以找到最相似的向量。它具有较高的召回率和准确率,但计算量大且速度慢,因此仅适用于小型数据集或离线分析。
  • Exact search: This method compares the query vector against every vector in the database to find the most similar ones. It offers high recall and accuracy but is computationally expensive and slow, making it suitable only for small datasets or offline analysis.
  • 人工神经网络搜索人工神经网络算法并非进行穷举比较,而是搜索数据库的较小子集以找到足够好的结果。这种搜索方式在速度和结果之间取得了平衡。精度对于扩展数百万或数十亿个向量至关重要,从而实现实时搜索和推理。
  • ANN search: Instead of exhaustive comparison, ANN algorithms search for a smaller subset of the database to find results that are good enough. This trade-off between speed and precision is essential for scaling millions or billions of vectors, enabling real-time search and inference.

常用的ANN技术包括以下几种:

Common ANN techniques include the following:

  • HNSW :一种基于图的方法,以快速准确的检索而闻名。
  • HNSW: A graph-based method known for fast and accurate retrieval.
  • IVF + PQ :它结合了聚类和压缩,以提高内存效率。
  • IVF + PQ: It combines clustering and compression for memory efficiency.
  • ScaNN :谷歌开发的一种高性能 ANN 算法,针对大规模生产环境进行了优化。
  • ScaNN: A high-performance ANN algorithm developed by Google, optimized for large-scale production environments.

大多数生产级向量数据库(如 Faiss、Milvus 或 Pinecone)都依赖 ANN 搜索来实现低延迟、高吞吐量的性能,而不会在相关性或召回率方面做出太大牺牲。

Most production-grade vector databases (like Faiss, Milvus, or Pinecone) rely on ANN search to deliver low-latency, high-throughput performance without sacrificing too much on relevance or recall.

嵌入和嵌入模型

Embeddings and embedding models

向量数据库的核心是嵌入的概念。嵌入是一个稠密的数值向量,它以一种方式捕捉输入(文本、图像、音频)的语义含义,使得相似的输入在向量空间中彼此更接近。

At the core of vector databases lies the concept of embeddings. Embedding is a dense numerical vector that captures the semantic meaning of an input (text, image, audio) in a way that similar inputs are closer together in the vector space.

例如,即使措辞不同,两个关于狗的句子也会有紧密相邻的嵌入。

For example, two sentences about dogs will have embeddings close together even if they use different wording.

嵌入模型是经过训练的神经网络,可以将输入映射到这些向量空间。一些常见的嵌入模型类型如下:

Embedding models are neural networks trained to map inputs into these vector spaces. Some popular types of embedding models are as follows:

  • 文本嵌入
    • OpenAI 的 text-embedding-ada-002
    • 句子 BERT ( SBERT )
    • 拥抱脸迷你LM
  • Text embeddings:
    • OpenAI's text-embedding-ada-002
    • Sentence-BERT (SBERT)
    • Hugging Face MiniLM
  • 图像嵌入
    • CLIP(OpenAI)
    • DNOv2(元)
  • Image embeddings:
    • CLIP (OpenAI)
    • DINOv2 (Meta)
  • 多模态嵌入
    • CLIP(联合视觉语言)
    • 火烈鸟(DeepMind)
  • Multimodal embeddings:
    • CLIP (joint vision-language)
    • Flamingo (DeepMind)

嵌入模型至关重要,因为检索质量很大程度上取决于嵌入的质量。

Embedding models are crucial because the quality of retrieval depends heavily on the quality of embeddings.

向量数据库对 RAG 和智能体系统的重要性

Importance of vector databases for RAG and agentic systems

在 RAG 管道中,查询的嵌入与向量数据库中存储的文档嵌入进行匹配,以检索最相关的知识来确定响应。

In RAG pipelines, embeddings of queries are matched against stored document embeddings inside a vector database to retrieve the most relevant knowledge for grounding responses.

在智能体系统中,智能体可能需要:

In agentic systems, an agent might need to:

  • 回顾过去的经历
  • Retrieve past experiences
  • 获取外部知识
  • Fetch external knowledge
  • 搜索工具输出结果
  • Search through the tool outputs

所有操作均在运行时动态进行,基于向量相似度,而不仅仅是关键词匹配。

All dynamically at runtime, based on vector-based similarity, not just keyword matching.

因此,向量数据库能够实现语义记忆和可扩展的智能检索,这两点是新一代人工智能的基石。下图展示了使用嵌入的类似搜索流程,其中输入数据被转换为向量以检索语义相似的结果:

Thus, vector databases enable semantic memory and scalable, intelligent retrieval, two cornerstones of the new age of GenAI. The following figure represents the workflow of a similar search pipeline using embeddings, where input data is transformed into vectors to retrieve semantically similar results:

流程图显示了四个步骤:文本/图像输入、嵌入模型、矢量数据库和相似结果,并用箭头连接,表示使用嵌入模型和数据库查找相似项的过程。

图 1.9:使用向量数据库进行语义检索的基本流程

Figure 1.9: Basic flow of semantic retrieval using a vector database

虽然向量数据库能够快速高效地检索语义相似的文档,但相似度搜索返回的结果并不总是与用户的真实意图完全一致。单纯基于向量相似度的检索有时会检索到相关性不高的文档,导致最终结果准确性降低或缺乏依据。

While vector databases enable fast and efficient retrieval of semantically similar documents, the top results returned by similarity search are not always perfectly aligned with the user's true intent. Retrieval-based purely on vector similarity can sometimes surface documents that are only loosely relevant, leading to less accurate or less grounded final outputs.

为了应对这一挑战,通常会引入一个重要的优化步骤,称为重排序。重排序允许人工智能系统基于更深层次的相关性评分对检索到的文档进行重新排序和优先级排序,从而提高最终传递给语言模型进行生成的输入质量。

To address this challenge, an important refinement step called reranking is often introduced. Reranking allows AI systems to reorder and prioritize retrieved documents based on deeper relevance scoring, improving the quality of the inputs ultimately passed to the language model for generation.

现在让我们来了解一下重排序,为什么需要它,它是如何工作的,以及现代人工智能流程中使用的不同方法。

Let us now understand reranking, why it is needed, how it works, and the different approaches used in modern AI pipelines.

重新排名

Reranking

重新排序的概念对人工智能来说并不新鲜,它在推荐系统和搜索引擎中有着深厚的渊源。

The concept of reranking is not new to AI. It has deep roots in recommendation systems and search engines.

在传统的推荐流程(例如,推荐产品、电影、文章)中,系统通常会根据用户偏好检索一大堆候选结果,例如排名前 100 或前 1000 的项目。基于用户历史记录或内容相似度等粗略匹配,检索系统会进行初步筛选。然而,这些初始候选结果往往并不完美,因为检索系统优先考虑召回率,力求尽可能多地检索到潜在的有效条目,即使这会牺牲精确度。

In traditional recommendation pipelines (e.g., recommending products, movies, articles), the system typically retrieves a broad set of candidates, say, the top 100 or top 1000 items, based on rough matching like user history or content similarity. However, these initial candidates are often imperfect, as retrieval systems prioritize recall, getting as many potentially good items as possible, even at the cost of precision.

因此,引入了重新排序步骤,详情如下:

Thus, a reranking step is introduced, details as follows:

  • 更复杂的模型(通常是更深层的神经网络)会重新评估初始数据集,并根据更准确的相关性得分重新排序。
  • A more sophisticated model (often a deeper neural network) re-evaluates the initial set and reorders them based on more accurate relevance scores.
  • 目标是最大限度地提高搜索结果顶部的精准度,确保向用户展示的前几个项目是最相关、最有影响力的。
  • The goal is to maximize precision at the top and ensure the first few items shown to the user are the most relevant and impactful.

这种两阶段方法,即检索重新排序,现在不仅是推荐系统的基础,而且也是现代RAG管道和搜索引擎的基础

This two-stage approach, retrieval and reranking, is now fundamental not just in recommendation systems but also in modern RAG pipelines and search engines.

双编码器与交叉编码器

Bi-encoders vs. cross-encoders

双编码器和交叉编码器是自然语言处理NLP )中用于语义搜索和排序等任务的两种常用架构。

Bi-encoders and cross-encoders are two popular architectures used for tasks like semantic search and ranking in natural language processing (NLP).

双编码器使用同一模型将查询和文档独立编码成不同的向量嵌入。然后可以使用余弦相似度或其他距离度量有效地比较这些嵌入,这使得双编码器非常适合对速度和可扩展性要求极高的大规模检索。

Bi-encoders independently encode the query and document into separate vector embeddings using the same model. These embeddings can then be efficiently compared using cosine similarity or other distance metrics, making bi-encoders ideal for large-scale retrieval where speed and scalability are critical.

另一方面,交叉编码器将查询和文档联合编码,并将它们一起输入到Transformer模型中。这使得模型能够考虑词元之间的交叉注意力,从而获得更准确的相关性评分。然而,这种方法计算量大且速度慢,限制了其在实时或大规模系统中的应用。

Cross-encoders, on the other hand, jointly encode the query and document by feeding them together into a transformer model. This allows the model to consider cross-attention between tokens, resulting in more accurate relevance scoring. However, this approach is computationally expensive and slower, limiting its use in real-time or large-scale systems.

在检索和重排序的背景下,一种常见的模式是使用双向编码器进行快速候选检索,然后使用交叉编码器对排名靠前的结果进行重排序以提高精度,从而有效地平衡效率和准确性:

In the context of retrieval and reranking, a common pattern is to use bi-encoders for fast candidate retrieval, followed by cross-encoders for reranking the top results to improve precision, balancing efficiency and accuracy effectively:

  • 双编码器
    • 将查询和文档分别编码成稠密向量。
    • 相似度是在编码之后计算的,通常通过点积或余弦相似度来计算。
    • 速度极快且可扩展,非常适合从大型语料库中进行首次检索。
    • 例如:DPR、句子转换器。
  • Bi-encoders:
    • Encode the query and documents separately into dense vectors.
    • Similarity is computed after encoding, usually via dot product or cosine similarity.
    • Extremely fast and scalable, and ideal for first-pass retrieval from large corpora.
    • Examples: DPR, Sentence Transformers.
  • 交叉编码器:
    • 查询和文档连接起来,并一起进行编码
    • 该模型将这两个部分结合起来,可以模拟细粒度的交互作用
    • 比双向编码器慢得多,因为每个查询-文档对都需要重新计算。
    • 然而,它在判断真正的相关性方面要准确得多。
    • 例如:BERT 用于段落排序,MiniLM 用于重排序。
  • Cross-encoders:
    • Concatenate the query and document together and encode them jointly.
    • The model sees both pieces together and can model fine-grained interactions.
    • Much slower than bi-encoders because every query-document pair needs to be recomputed.
    • However, much more accurate at judging true relevance.
    • Examples: BERT for passage ranking, MiniLM for reranking.

用于重排序的交叉编码器

Cross-encoders for reranking

在 RAG 流程或 AI 搜索系统中,典型的工作流程如下:

In RAG pipelines or AI search systems, the typical workflow is as follows:

  • 使用双编码器检索器快速获取前 k 个候选结果。
  • Use a bi-encoder retriever to fetch top-k candidates quickly.
  • 对这k 个候选者(例如,前 100 个)应用交叉编码器重排序器,以重新评分并重新排序。
  • Apply a cross-encoder reranker on these k candidates (e.g., top 100) to rescore and reorder them.
  • 选择排名前 n 的重新排名文档(例如,前 5 个或前 10 个)输入到生成模型中。
  • Select the top-n reranked documents (e.g., top 5 or top 10) to feed into the generation model.

由于交叉编码器模型在查询文档和候选文档之间实现了深度交互,因此显著提高了检索信息的质量,从而在生成过程中产生了更有依据、更准确、更具上下文相关性的输出。

Due to the cross-encoder model’s deep interactions between the query and candidate documents, it significantly improves the quality of retrieved information, leading to better-grounded, more accurate, and more contextually relevant outputs during generation.

因此,重排序,特别是使用交叉编码器的重排序,是当今构建高精度、生产级 AI 检索系统的重要工具,如下图所示:

Thus, reranking, especially using cross-encoders, is a vital tool in building high-precision, production-grade AI retrieval systems today, as shown in the following figure:

流程图显示向量搜索结果输入到名为交叉编码器的重排序器中,生成重排序结果,然后筛选出前 K 个结果。

图 1.10:用于提高文档相关性的重排序架构

Figure 1.10: Reranking architecture for improving document relevance

虽然重排序可以提高检索信息的质量和相关性,但它并不能从根本上保证人工智能的最终输出始终安全、公正或符合应用需求。即使检索精度很高,生成模型仍然可能出现偏差,引入敏感内容,或生成与用户预期不符的输出。为了应对这些风险,现代人工智能系统实施了防护机制、结构化控制和验证机制,旨在监控、过滤和调整模型行为。下一节我们将探讨防护机制的概念、其重要性以及它们在检索和生成流程中的应用。

While reranking improves the quality and relevance of retrieved information, it does not inherently guarantee that the AI’s final output will always be safe, unbiased, or aligned with application requirements. Even with high-precision retrieval, generation models can still hallucinate, introduce sensitive content, or produce outputs that deviate from user expectations. To address these risks, modern AI systems implement guardrails, structured controls, and validation mechanisms designed to monitor, filter, and shape model behavior. In the next section, we will explore the concept of guardrails, why they are essential, and how they are applied across retrieval and generation pipelines.

护栏

Guardrails

随着人工智能系统能力的不断提升,设置防护机制(即引导和约束模型行为的结构化控制措施)变得至关重要。防护机制确保模型即使在处理复杂、开放式的输入时,也能安全、合乎伦理地运行,并符合应用或组织的目标。

As AI systems become increasingly capable, the need for guardrails, structured controls that guide and constrain model behavior, has become critical. Guardrails ensure that models act safely, ethically, and in alignment with application or organizational goals, even when handling complex, open-ended inputs.

虽然重新排序有助于呈现更多相关且真实的信息,但它本身并不能防止幻觉、偏见传播、政策违规或用户操纵。逻辑逻辑模型(LLM)功能强大但缺乏确定性;即使输入数据干净,如果不加以控制,也可能产生不安全、冒犯性或误导性的输出。安全防护措施有助于维护信任、安全性和合规性。在现实世界环境中部署人工智能系统时,尤其是在企业、医疗保健、金融和教育领域,这些都是至关重要的因素。下图展示了一个增强了安全防护措施、重新排序和基于逻辑逻辑模型的响应生成的红绿灯(RAG)系统的架构:

While reranking helps in surfacing more relevant and factual information, it does not inherently prevent hallucinations, bias propagation, policy violations, or user manipulation. LLMs are powerful but non-deterministic; even with clean inputs, they can produce unsafe, offensive, or misleading outputs if left unchecked. Guardrails help maintain trust, safety, and compliance. They are all crucial factors when deploying AI systems in real-world environments, especially in the enterprise, healthcare, finance, and education sectors. The following figure illustrates the architecture of a RAG system enhanced with guardrails, reranking, and LLM-based response generation:

流程图显示了查询经过嵌入、向量数据库搜索、重排序和 LLM 处理以生成结果的过程,并在输入和输出阶段都应用了防护措施。

图 1.11:配备护栏的 GenAI 系统的端到端示意图

Figure 1.11: End to end figure of guardrail enabled GenAI system

护栏类型

Types of guardrails

护栏通常分两个主要阶段运行,具体如下:

Guardrails typically operate at two major stages, which are described in the following list:

  • 输入防护机制:它们会在问题用户查询到达模型之前对其进行过滤、重述或阻止。例如:
    • 检测并阻止恶意提示(例如,要求模型生成有害内容)。
    • 将模糊、模棱两可或有风险的查询改写成更安全的形式。
    • 在提示中添加免责声明或限制条件,以明确对模型行为的预期。
  • Input guardrails: They filter, rephrase, or block problematic user queries before they reach the model. Examples include:
    • Detecting and blocking malicious prompts (e.g., asking the model to generate harmful content).
    • Rewriting vague, ambiguous, or risky queries into safer forms.
    • Adding disclaimers or constraints to the prompt to set clear expectations for model behavior.
  • 输出防护机制:输出防护机制在将模型生成的输出交付给用户之前,对其进行分析、验证和修改。例如:
    • 删除有害、有偏见或不安全的内容。
    • 确保产出符合组织政策(例如,不共享个人数据)。
    • 在显示答案之前,会先根据可信的知识库进行事实核查。
    • 强制执行语气、风格或内容格式要求(例如,专业、中立)。
  • Output guardrails: Output guardrails analyze, validate, and modify model-generated outputs before they are delivered to the user. Examples include:
    • Removing toxic, biased, or unsafe content.
    • Ensuring outputs comply with organizational policies (e.g., no sharing of personal data).
    • Fact-checking answers against trusted knowledge bases before displaying them.
    • Enforcing tone, style, or content formatting requirements (e.g., professional, neutral).

护栏的安装方法

Methods of applying guardrails

护栏的实施是通过多种技术的结合来实现的:

Guardrails are implemented through a combination of techniques:

  • 提示工程:精心设计提示,引导更安全的行为。
  • Prompt engineering: Structuring prompts carefully to guide safer behavior.
  • 审核 API :通过毒性和安全检测器运行输出(例如,OpenAI 审核 API)。
  • Moderation APIs: Running outputs through toxicity and safety detectors (e.g., OpenAI Moderation API).
  • 策略引擎:为主题、关键字或行为定义明确的允许列表和阻止列表。
  • Policy engines: Defining explicit allowlists and blocklists for topics, keywords, or behaviors.
  • 后生成过滤:在将不安全的响应发送给用户之前,对其进行分析、编辑或丢弃。
  • Post-generation filtering: Analyzing and editing or discarding unsafe responses before sending them to users.
  • 基于规则的执法:使用预设规则来检测特定违规行为。
  • Rule-based enforcement: Using pre-set rules to catch specific violations.
  • 检索增强型防护措施:使用 RAG 在输出前交叉验证事实或检查一致性。
  • Retrieval-augmented guardrails: Using RAG to cross-validate facts or check consistency before output.

没有护栏

Without guardrails

如果没有防护措施,人工智能系统容易受到以下影响:

Without guardrails, AI systems are vulnerable to the following:

  • 越狱:用户精心设计棘手的提示,迫使模型绕过限制并生成禁止的输出。
  • Jailbreaking: Users craft tricky prompts to force the model to bypass restrictions and generate forbidden outputs.
  • 幻觉:该模型在没有验证的情况下捏造出看似合理但却是虚假的信息。
  • Hallucination: The model invents plausible but false information without validation.
  • 偏见放大:该模型无意中强化了刻板印象或不公平的内容。
  • Bias amplification: The model unintentionally reinforces stereotypes or unfair content.
  • 安全风险:泄露敏感信息、泄露系统细节或导致有害行为。
  • Security risks: Revealing sensitive information, leaking system details, or enabling harmful actions.

这些风险可能会给组织造成声誉损害、违反合规规定、损害用户利益,甚至带来法律后果。

These risks can cause reputational damage, compliance violations, user harm, and even legal consequences for organizations.

护栏解决方案的行业案例

Industry examples of guardrail solutions

以下领先的人工智能平台已经意识到需要健全的防护措施,并构建了专门的框架:

The following leading AI platforms have recognized the need for robust guardrails and built specialized frameworks:

  • NVIDIA NeMo 护栏
    • 专注于可信赖的对话式人工智能的开源工具包。
    • 支持输入过滤、输出审核、对话流程控制和基于语境的生成。
    • 允许开发者使用 YAML 文件以声明方式定义Rails ,从而强制机器人和助手的行为。
  • NVIDIA NeMo guardrails:
    • Open-source toolkit focused on trustworthy conversational AI.
    • Supports input filtering, output moderation, conversation flow control, and grounded generation.
    • Allows developers to define rails declaratively using YAML files to enforce behavior across bots and assistants.
  • Azure AI 提示防护
    • 微软的企业级解决方案与 Azure OpenAI 服务集成。
    • 检测并阻止提示注入攻击、攻击性内容和越狱尝试。
    • 它既能主动筛选输入数据,又能被动调节输出结果,因此对受监管行业非常有效。
  • Azure AI Prompt Shields:
    • Microsoft's enterprise-grade solution is integrated with the Azure OpenAI Service.
    • Detects and blocks prompt injection attacks, offensive content, and jailbreak attempts.
    • Provides both proactive input screening and reactive output moderation, making it highly effective for regulated industries.
  • OpenAI 内容审核 API :OpenAI 内容审核 API 是应用最广泛的自动内容安全检查工具之一。该 API 会分析用户输入和模型输出,以检测各类敏感内容,例如:
    • 骚扰
    • 性内容
    • 暴力
    • 自我伤害
    • 误导性信息
  • OpenAI Moderation API: One of the most widely used tools for automatic content safety checks is the OpenAI Moderation API. This API analyzes both user inputs and model outputs to detect sensitive content across categories such as:
    • Hate
    • Harassment
    • Sexual content
    • Violence
    • Self-harm
    • Misleading information

审核 API 会返回详细的分数,这些分数表明违规的可能性,从而允许开发者:

The Moderation API returns detailed scores indicating the likelihood of a violation, allowing developers to:

  • 自动阻止不安全的查询或代码补全。
  • Automatically block unsafe queries or completions.
  • 标记输出结果以供人工审核。
  • Flag outputs for human review.
  • 根据严重程度阈值自定义工作流程。
  • Customize workflows based on severity thresholds.

通过将审核 API 集成到生产管道中,开发人员可以确保模型的行为符合安全和合规标准,而无需持续的人工监控。

By integrating the Moderation API into production pipelines, developers ensure that models behave consistently with safety and compliance standards, without requiring constant manual monitoring.

这些工具表明,防护措施不再是可有可无的,而是构建负责任的、可用于生产的 AI 应用的基础。

These tools show that guardrails are no longer optional and are foundational to building responsible, production-ready AI applications.

虽然检索、重排序和防护机制显著提高了人工智能系统的可靠性和安全性,但真正的智能行为要求模型超越单次响应。现代人工智能应用越来越多地涉及智能体。智能体是能够自主推理、决策、规划和使用工具的系统。尽管我们将探讨智能体,但……在第五章“实现具有人机交互的智能体全智能系统”中,我们将更深入地探讨这些内容。在此之前,有必要介绍一些核心概念:智能体如何利用工具、进行推理、制定计划、执行动作、跨任务保持记忆,以及如何在多智能体系统中协作解决复杂目标。理解这些基础概念将有助于我们在后续章节中构建更动态、更具适应性的人工智能解决方案。

While retrieval, reranking, and guardrails significantly enhance the reliability and safety of AI systems, true intelligent behavior requires models to go beyond single-turn responses. Modern AI applications increasingly involve agents. They are systems capable of autonomous reasoning, decision-making, planning, and tool use. Although we will explore agents in greater depth in Chapter 5, Implementing Agentic GenAI Systems with Human-AI Interaction, it is important to introduce the core concepts: how agents leverage tools, perform reasoning, develop plans, execute actions, maintain memory across tasks, and collaborate in multi-agent systems to solve complex goals. Understanding these foundational ideas will prepare us for building more dynamic, adaptable AI solutions in the chapters ahead.

代理人

Agents

生成人工智能(GenAI)智能体是一种智能软件系统,它利用生成模型(例如逻辑线性模型或扩散模型)来理解、推理并根据用户输入或环境刺激创建内容。它可以执行诸如回答问题、生成文本或图像、总结内容,甚至协作解决问题等任务。生成人工智能智能体通常与各种工具或应用程序接口(API)集成,既可以独立运行,也可以在大型多智能体系统中运行。它们观察输入,基于学习到的模式做出决策,并采取与目标一致的行动,在创造性和功能性环境中模拟人类的认知能力。请参考以下列表以更深入地了解智能体:

A GenAI agent is an intelligent software system that uses generative models, such as LLMs or diffusion models, to understand, reason, and create content in response to user input or environmental stimuli. It can perform tasks like answering questions, generating text or images, summarizing content, or even collaborating in problem-solving. GenAI agents often integrate with tools or APIs and can operate autonomously or within a larger multi-agent system. They observe inputs, make decisions based on learned patterns, and take actions aligned with their goals, mimicking human-like cognition in creative and functional contexts. Refer to the following list to build a deeper understanding of agents:

  • 工具是指代理可以调用​​的外部函数、API 或实用程序,它们可以扩展代理的功能,使其超越纯文本生成。代理不再需要依靠自身知识来回答所有问题,而是可以访问搜索引擎、数据库、计算器、知识图谱或自定义 API 来收集实时信息或执行特定操作。工具使代理更加强大,能够与外部世界交互、检索最新信息、查询私有数据或执行仅靠模型无法可靠完成的任务。
  • Tools are external functions, APIs, or utilities that an agent can call upon to extend its capabilities beyond pure text generation. Instead of trying to answer all questions from its own knowledge, an agent can access search engines, databases, calculators, knowledge graphs, or custom APIs to gather real-time information or perform specific operations. Tools make agents more powerful, enabling them to interact with the external world, retrieve up-to-date facts, query private data, or execute tasks that a model alone could not accomplish reliably.
  • 推理是指智能体逐步思考问题的能力,而不是立即生成输出。通过推理,智能体评估当前情况,确定已知信息,识别需要了解的信息,并选择一系列行动以逐步接近任务目标。推理使智能体能够将复杂的目标分解为易于处理的子问题,应对不确定性,并根据新信息或意外结果动态调整自身行为。
  • Reasoning refers to the agent’s ability to think through problems step-by-step, rather than immediately generating an output. Through reasoning, an agent evaluates the current situation, decides what it knows, identifies what it needs to find out, and chooses a sequence of actions that move it closer to solving the task. Reasoning allows agents to break down complex goals into manageable subproblems, handle uncertainty, and adapt their behavior dynamically based on new information or unexpected results.
  • 规划建立在推理之上,涉及为实现既定目标而对未来步骤进行结构化组织。一个设计良好的智能体并非只是被动地一步一步做出反应,而是能够构建一个灵活的、目标导向的计划——一系列精心设计的工具调用、决策和行动,最终导向成功的结果。高级智能体通过自我反思、评估中间步骤并在必要时调整策略以及CoT提示等技术来增强其规划能力。CoT提示鼓励智能体在执行前进行系统性的多步骤推理。这使得智能体不仅能够设定中间目标并确定任务优先级,还能在面对意外结果或信息不完整时动态地进行即兴发挥。有效的规划使智能体能够高效地应对复杂的多阶段流程,而不是冲动行事或被早期错误所困。规划、反思、逻辑推理和实时适应的能力对于解决需要在多个行动和决策点上持续、连贯地推进的现实世界任务至关重要。
    • 动作代表执行阶段,在此阶段,智能体执行其计划中的特定步骤,例如调用外部工具、发出 API 请求、将​​信息保存到内存或向用户返回答案。动作是智能体系统的执行部分,在此阶段,推理和规划转化为可观察的操作。重要的是,动作可以根据先前动作的结果动态变化;智能体可以修改其计划,再次进行推理,并采取不同的后续步骤。这种持续的思考和行动循环使智能体区别于静态模型。
    • 记忆和多智能体协作是进一步增强智能体能力的先进概念。记忆使智能体能够在不同的步骤、会话甚至任务中保留信息,从而实现连续性、个性化和长期学习。拥有记忆功能的智能体无需每次都从头开始,而是可以回忆过去的交互、中间结果和不断变化的用户偏好。多智能体协作进一步扩展了这一功能:多个专业化的智能体可以协同工作,共享任务、分配职责并相互沟通,从而比单个智能体更高效地解决复杂目标。具备记忆和协作能力的系统开始类似于协调的、模块化的智能体生态系统,这些智能体共同朝着共同的目标努力。下图展示了智能体如何与其环境交互并采取行动:
  • Planning builds upon reasoning and involves the structured organization of future steps to achieve a given objective. A well-designed agent does not simply react one step at a time but is capable of constructing a flexible, goal-directed plan—a deliberate sequence of tool calls, decisions, and actions that lead to a successful outcome. Advanced agents enhance their planning through techniques like self-reflection, evaluating their intermediate steps and adjusting strategies if needed, and CoT prompting, which encourages systematic, multi-step reasoning before execution. This enables agents not only to set intermediate goals and prioritize tasks but also to improvise dynamically when faced with unexpected results or incomplete information. Effective planning allows agents to navigate complex, multi-stage processes efficiently rather than acting impulsively or getting trapped by early errors. The ability to plan, reflect, reason through chains of logic, and adapt in real-time is critical for solving real-world tasks that require sustained, coherent progress across multiple actions and decision points.
    • Action represents the execution phase where an agent implements a specific step in its plan, such as calling an external tool, making an API request, saving information to memory, or returning an answer to the user. Actions are the doing part of agentic systems, where reasoning and planning turn into observable operations. Importantly, actions can be dynamic based on the results of previous actions; the agent may revise its plan, reason again, and take different subsequent steps. This continuous loop of thinking and acting distinguishes agents from static models.
    • Memory and multi-agent collaboration are advanced concepts that further enhance an agent’s capabilities. Memory allows agents to retain information across different steps, sessions, or even tasks, providing continuity, personalization, and long-term learning. Instead of starting from scratch every time, an agent with memory can recall past interactions, intermediate results, and evolving user preferences. Multi-agent collaboration expands this even further: multiple specialized agents can work together, sharing tasks, delegating responsibilities, and communicating with each other to solve complex goals more efficiently than a single agent could. Systems with memory and collaboration capabilities begin to resemble coordinated, modular ecosystems of intelligent agents working towards shared objectives. The following figure illustrates how an agent interacts with its environment and takes action:
流程图显示向 LLM 发送的查询,生成计划,然后由代理在环境中使用工具,最终产生操作、结果和记忆,并通过推理将操作与结果联系起来。

图 1.12:智能体流程,智能体如何与环境交互并采取行动

Figure 1.12: Agents flow, how an agent interacts with environment and takes action

主动型 RAG 与非主动型 RAG

Agentic RAG vs. non-agentic RAG

在非智能体 RAG 系统中,流程是线性且静态的:用户嵌入查询,检索器获取前 k 个文档,语言模型利用检索到的上下文生成答案。每个步骤都遵循固定的流程,缺乏动态决策。非智能体 RAG 在简单的问答任务中表现出色,因为初始检索结果通常足够,但当检索结果嘈杂、模糊或不足以进行复杂推理时,它就显得力不从心。

In a non-agentic RAG system, the process is linear and static: a user query is embedded, a retriever fetches top-k documents, and the language model-generates an answer using the retrieved context. Each step follows a fixed pipeline without dynamic decision-making. Non-agentic RAG excels in simple question answer tasks where the initial retrieval is usually sufficient, but it struggles when retrieval results are noisy, ambiguous, or insufficient for complex reasoning.

相比之下,智能体 RAG 系统引入了动态控制、推理和适应性。智能体首先评估查询,检索初始文档,并判断信息是否充足。如果信息不足,智能体可以重新构建查询,执行多次检索,选择不同的工具(例如搜索 API 或数据库),反思中间结果,并动态规划多个步骤以得出更可靠的答案。智能体 RAG 系统可以迭代地从多个知识源检索、重新排序、推理和综合信息,并实时调整以解决复杂、多跳或歧义的查询。

In contrast, an agentic RAG system introduces dynamic control, reasoning, and adaptability. An agent first assesses the query, retrieves initial documents, and reasons about whether the information is sufficient. If not, the agent can reformulate the query, perform multiple retrievals, choose different tools (like search APIs or databases), reflect on intermediate results, and dynamically plan multiple steps to arrive at a better-grounded answer. Agentic RAG systems can iteratively retrieve, rerank, reason, and synthesize across multiple knowledge sources, adapting in real-time to solve complex, multi-hop, or ambiguous queries.

因此,虽然非智能体 RAG 对于简单的任务来说简单快捷,但智能体 RAG 对于构建真正智能、可靠的系统至关重要,这些系统能够处理不确定性、不完整的数据或不断变化的信息需求。图 1.13展示了一个多智能体系统,该系统包含一个编排智能体和两个能够执行联合任务以及单独任务的附加智能体:

Thus, while non-agentic RAG is simple and fast for straightforward tasks, agentic RAG is critical for building truly intelligent, reliable systems that can handle uncertainty, incomplete data, or evolving information needs. Figure 1.13 illustrates a multi-agent system featuring an orchestration agent and two additional agents capable of performing joint tasks as well as individual tasks:

流程图展示了向 LLM 发送的查询,LLM 会为两个代理创建计划。每个代理使用工具、执行操作、存储记忆并生成结果,所有操作均由协调代理进行协调。

图 1.13 :带有编排代理的多智能体系统

Figure 1.13: Multi-agent systems with an orchestration agent

模型上下文协议

Model Context Protocols

智能体系统赋予人工智能模型自主推理、规划和执行任务的能力,其方式是通过动态地使用工具、API 和外部知识库。然而,如果没有标准化的方法来发现和交互这些工具,扩展就会变得混乱且脆弱。这正是 MCP 的关键所在。MCP 为智能体提供了一个通用的、与语言无关的接口,使其能够无缝访问工具、数据和提示,从而确保安全、模块化和动态的集成。

Agentic systems empower AI models to reason, plan, and execute tasks autonomously by dynamically using tools, APIs, and external knowledge sources. However, without a standardized way to discover and interact with these tools, scaling becomes chaotic and fragile. This is where the MCP is essential. MCP provides a universal, language-agnostic interface for agents to seamlessly access tools, data, and prompts, ensuring secure, modular, and dynamic integration.

MCP 是一种开放标准,旨在简化和规范 AI 模型与外部工具、数据源和 API 的交互方式。MCP 由Anthropic公司推出,它充当通用通信层,类似于 AI 领域的 USB-C 接口,使 AI 助手和代理能够无缝检索结构化信息、调用操作或应用特定领域的提示,而无需为每个后端系统进行自定义集成。

MCP is an open standard designed to simplify and standardize how AI models interact with external tools, data sources, and APIs. Introduced by Anthropic, MCP acts as a universal communication layer, much like a USB-C for AI, enabling AI assistants and agents to seamlessly retrieve structured information, invoke actions, or apply domain-specific prompts without custom integrations for every backend system.

MCP 的核心是构建客户端-服务器架构,其中服务器公开三种基本要素:工具(执行操作的函数)、资源(数据,例如文档或 API)和提示(AI 行为的指导)。MCP 使用轻量级、与语言无关的协议,例如 JSON-RPC,并通过 Studio 或 HTTP/SSE 等传输协议进行通信,因此易于在各种环境中集成。

At its core, MCP establishes a client-server architecture where servers expose three primitives: tools (functions that perform actions), resources (data like documents or APIs), and prompts (guidance for AI behavior). MCP uses lightweight, language-agnostic protocols like JSON-RPC over transports such as studio or HTTP/SSE, making it easy to integrate across diverse environments.

通过采用 MCP,开发人员可以构建可扩展的 AI 系统,无需重新训练模型或硬编码 API,即可动态发现和利用新的工具和数据源。MCP 还确保了模块化、安全性和面向未来的适应性,这对于医疗保健、金融和企业自动化等行业至关重要。随着 AI 生态系统日益复杂,MCP 为构建可互操作、安全且敏捷的 AI 系统奠定了基础,这些系统可以通过统一的接口在多个领域进行推理和行动。

By adopting MCP, developers can build scalable AI systems where new tools and data sources can be dynamically discovered and utilized without retraining models or hardcoding APIs. MCP also ensures modularity, security, and future-proofing, critical for sectors like healthcare, finance, and enterprise automation. As AI ecosystems grow increasingly complex, MCP provides a foundation for building interoperable, secure, and agile AI systems that can reason and act across multiple domains through a unified interface.

智能体系统与 MCP 的结合,使人工智能能够更智能、更可靠地运行,无需硬编码依赖即可适应现实世界的复杂性,从而在医疗保健、金融和教育等行业中释放强大的应用潜力。图 1.14展示了 MCP 如何建立客户端-服务器架构并与外部工具、数据源和 API 进行交互:

Together, agentic systems and MCP enable AI to operate more intelligently and reliably, adapting to real-world complexities without hardcoded dependencies, unlocking powerful applications across industries like healthcare, finance, and education. Figure 1.14 shows how MCP establishes a client-server architecture and interacts with external tools, data sources, and APIs:

流程图展示了企业级LLM如何向MCP客户端提供数据,客户端再连接到MCP服务器。服务器链接到分别代表视觉、本地搜索、网络搜索、数据库、文档、操作系统、代码、GitHub和API的图标。

图 1.14:MCP 建立客户端-服务器架构,并与外部工具、数据源和 API 进行交互。

Figure 1.14: MCP establishes a client-server architecture and interacts with external tools, data sources, and APIs

结论

Conclusion

本章为理解现代生成人工智能(GenAI)系统的设计和编排奠定了基础。我们首先区分了检索系统和生成系统,探讨了它们各自在构建智能人工智能解决方案中发挥的关键作用。我们讨论了从传统的基于关键词的检索到由嵌入驱动的密集向量搜索的演变,以及向量数据库如何实现可扩展的实时语义检索。超越基础检索,我们介绍了重排序技术,特别是交叉编码器的应用,以优化检索到的文档并确定其优先级,从而提高相关性和精确度。随后,我们强调了安全防护措施的重要性,以确保人工智能的输出安全、符合伦理,并符合现实世界的使用标准。最后,我们介绍了新兴的智能体人工智能系统,涵盖了工具使用、推理、规划、行动、记忆和多智能体协作等关键概念。

In this chapter, we laid the foundation for understanding how modern GenAI systems are designed and orchestrated. We began by differentiating between retrieval systems and generation systems, exploring how each plays a critical role in building intelligent AI solutions. We discussed the evolution from traditional keyword-based retrieval to dense vector search powered by embeddings, and how vector databases enable scalable, real-time semantic retrieval. Moving beyond basic retrieval, we introduced reranking techniques, particularly the use of cross-encoders, to refine and prioritize retrieved documents for greater relevance and precision. We then emphasized the importance of guardrails to ensure AI outputs are safe, ethical, and aligned with real-world usage standards. Finally, we introduced the emerging world of agentic AI systems, covering key concepts such as tool use, reasoning, planning, action, memory, and multi-agent collaboration.

下一章,我们将探索不断扩展的多模态系统领域,人工智能应用不再局限于单一的输入或输出模式。重点将转向多模态GenAI架构,在这种架构中,文本、图像和结构化数据在统一的框架内进行交互。读者将学习人工智能系统如何将文本转换为图像,如何将图像解读为描述,如何组合输入以生成新的输出,甚至如何将自然语言翻译成结构化查询语言SQL )。这为构建丰富且具有上下文感知能力的人工智能体验奠定了基础。

In the next chapter, we explore the expanding frontier of multimodal systems, where AI applications are no longer limited to a single mode of input or output. The focus then shifts to multimodal GenAI architectures, where text, images, and structured data interact within unified frameworks. Readers will learn how AI systems transform text into images, interpret images into descriptions, combine inputs for new outputs, and even translate natural language into Structured Query Language (SQL). This sets the foundation for building rich, contextually aware AI experiences.

C2深入了解多式联运系统

CHAPTER 2Deep Dive into Multimodal Systems

介绍

Introduction

第一章介绍了现代生成式人工智能GenAI )的基础知识,涵盖检索系统、生成模型、检索增强生成RAG )、编排、分词、向量数据库、重排序、防护机制、代理系统和模型上下文协议MCP )。这些核心组件为构建智能的、文本驱动的生成系统奠定了基础。

The first chapter introduced the foundations of modern generative AI (GenAI), covering retrieval systems, generation models, retrieval-augmented generation (RAG), orchestration, tokenization, vector databases, reranking, guardrails, agent systems, and Model Context Protocols (MCPs). These core components established the groundwork for building intelligent, text-driven generative systems.

在此基础上,本章探讨了人工智能向多模态领域的演进,即文本、图像和其他数据类型的协同处理。我们首先在视觉语言模型VLM )的背景下解释交叉编码器和双编码器,然后讨论多模态向量嵌入和多模态向量数据库的设计。

Building on this foundation, this chapter explores the evolution of AI into multimodal domains, where text, images, and other data types are processed together. We begin by explaining cross-encoders and bi-encoders within the context of vision-language models (VLMs), followed by a discussion on multimodal vector embeddings and the design of multimodal vector databases.

本章进一步阐明了虚拟语言模型(VLM)与更广泛的多模态全息人工智能(GenAI)系统的区别。文中涵盖了实际应用,包括文本到图像生成、图像到文本的描述、文本和图像到图像的合成,以及基于文本的规范和图像生成。此外,我们还探讨了文本到SQL查询生成如何扩展多模态人工智能系统的潜力。

The chapter further clarifies how VLMs differ from broader multimodal GenAI systems. Practical applications, including text-to-image generation, image-to-text captioning, text and image-to-image synthesis, and text-driven specification and image generation, are covered. Additionally, we explore how text-to-SQL query generation expands the potential of multimodal AI systems.

本章中,我们将从理解生成模型的基本机制,逐步过渡到开发能够进行复杂、跨模态推理的系统,从而为在现实世界环境中进行高级应用做好准备。

Through this chapter, we move from understanding the basic mechanisms of generative models to developing systems capable of sophisticated, cross-modal reasoning, positioning us for advanced applications in real-world environments.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 理解视觉语言模型
  • Understanding vision-language model
  • 实施方案比较
  • Implementation comparisons
  • 多模态生成式人工智能系统与虚拟语言模型
  • Multimodal generative AI systems vs. VLMs
  • 视觉语言模型
  • Vision-language models
  • 基于输出的多模态系统分类
  • Output-based classification of multimodal systems

目标

Objectives

本章旨在帮助读者全面理解设计和部署现代全智能(GenAI)系统所需的关键构建模块。通过掌握检索和生成系统、向量数据库、嵌入技术、高级提示策略、智能体架构和多智能体协作等概念,读者将为构建智能、可扩展的人工智能解决方案奠定坚实的基础。此外,本章还介绍了本地模型部署、图形处理器GPU )基础设施、语音处理、智能体内存管理以及多智能体协作(MCP)等行业标准等关键主题。这些基础要素对于构建多模态、可靠且可用于生产环境的人工智能应用至关重要。

This chapter aims to equip readers with a comprehensive understanding of the key building blocks essential for designing and deploying modern GenAI systems. By mastering concepts such as retrieval and generation systems, vector databases, embedding techniques, advanced prompting strategies, agentic architectures, and multi-agent collaboration, readers will gain a strong foundation for building intelligent, scalable AI solutions. Additionally, the chapter introduces critical topics like local model deployment, graphics processing unit (GPU) infrastructure, speech processing, memory management in agents, and industry standards like MCPs. These foundational elements are crucial for advancing toward multimodal, reliable, and production-ready AI applications.

理解视觉语言模型

Understanding vision-language models

视觉语言模型(VLM)是多模态人工智能系统的基础,它弥合了视觉和文本理解之间的鸿沟。与主要独立处理文本或图像的传统全息人工智能(GenAI)不同,VLM旨在跨模态进行联合解释、对齐和生成。随着越来越多的组织寻求构建能够“看”“说”的系统,VLM对于视觉问答VQA )、图像字幕生成、跨模态检索,甚至文本驱动的图像生成等应用都变得至关重要。

VLMs form the foundation of multimodal AI systems that bridge the gap between visual and textual understanding. Unlike traditional GenAI, which primarily processes text or images in isolation, VLMs are designed to jointly interpret, align, and generate across both modalities. As organizations increasingly look to create systems that see and talk, VLMs have become critical for applications such as visual question answering (VQA), captioning, cross-modal retrieval, and even text-driven image generation.

在前一章讨论核心生成和检索概念的基础上,本节深入探讨了 VLM 的架构、类型和功能,重点介绍了它们如何将分词、向量嵌入、检索和生成原理扩展到多种数据形式。

Building on the foundations laid in the previous chapter, where we discussed core generative and retrieval concepts, this section delves into the architecture, types, and capabilities of VLMs, highlighting how they extend the principles of tokenization, vector embeddings, retrieval, and generation across multiple data forms.

视觉语言模型的分类

Categories of vision-language models

视觉语言模型(VLM)是功能强大的AI系统,它融合了视觉和文本理解能力,使机器能够处理、解释和生成跨两种模态的信息。随着该领域的不断发展,VLM的应用范围日益广泛,从检索与搜索查询匹配的图像到生成详细的图像描述,甚至对文档进行推理。为了更好地理解VLM的功能和设计,我们可以对其进行广泛的分析。根据其核心目标,虚拟语言模型(VLM)可分为以下几类:检索、字幕生成和质量保证(QA)、生成式合成、多模态推理和指令调整。每一类都体现了独特的架构重点,并支持电子商务、无障碍访问、教育和设计等行业的实际应用。VLM 可以根据其主要任务和架构设计目标进行大致分类,具体如下:

VLMs are powerful AI systems that integrate visual and textual understanding, enabling machines to process, interpret, and generate information across both modalities. As the field evolves, VLMs are increasingly specialized in serving diverse applications, from retrieving the right image for a search query to generating detailed image descriptions or even reasoning over documents. To better understand their capabilities and design, VLMs can be broadly categorized based on their core objectives: retrieval, captioning, and QA, generative synthesis, multimodal reasoning, and instruction-tuning. Each category reflects a unique architectural focus and supports real-world applications across industries like e-commerce, accessibility, education, and design. VLMs can be broadly classified based on their primary tasks and architectural design goals, which are as follows:

  • 面向检索的虚拟语言模型:这类模型经过训练,能够将图像和文本对齐到共享的嵌入空间,从而实现高效的跨模态检索。其关键目标是找到与文本查询最相关的图像(反之亦然):
    • CLIP(OpenAI):经过 4 亿个图像-文本对的训练;它学习将图像及其相关的文本描述紧密地嵌入到潜在空间中。
    • ALIGN(谷歌):将检索范式扩展到来自网络的数十亿个带噪声的图像-文本对。
    • DeCLIP :通过考虑更难的负样本和更好的对比损失,改进了对比语言-图像预训练CLIP )。
    • 应用领域:图像搜索引擎、内容审核、电子商务产品发现。
  • Retrieval-focused VLMs: These models are trained to align images and text into a shared embedding space, allowing efficient cross-modal retrieval. A key objective is to find the most relevant image given by a text query (or vice versa):
    • CLIP (OpenAI): Trained on 400 million image-text pairs; it learns to embed images and their associated text descriptions closely in the latent space.
    • ALIGN (Google): Scales the retrieval paradigm to billions of noisy image-text pairs from the web.
    • DeCLIP: Improves over Contrastive Language–Image Pretraining (CLIP) by considering harder negatives and better contrastive losses.
    • Applications: Image search engines, content moderation, e-commerce product discovery.
  • VQA和图像字幕 VLM :这些模型旨在回答有关图像的问题或生成描述图像的字幕:
    • 从 Transformer 学习跨模态编码器表示 (LXMERT) (Facebook):LXMERT通过跨模态转换器合并单独的视觉和语言表示。
    • 通用图像-文本表示学习 (UNITER) (微软): UNITER训练视觉和文本的统一表示,取得了强大的 VQA 结果。
    • VisualBERT :它将视觉区域特征早期融入到Transformers 的双向编码器表示( BERT ) 文本转换器中,从而改进字幕生成和 VQA。
    • 应用:辅助工具(例如,为视障人士提供的图像描述)、视觉助手和内容标记。
  • VQA and captioning VLMs: These models are designed to answer questions about images or generate captions that describe images:
    • Learning Cross-Modality Encoder Representations from Transformers (LXMERT) (Facebook): LXMERT merges separate visual and language representations via a cross-modal transformer.
    • UNiversal Image-TExt Representation Learning (UNITER) (Microsoft): UNITER trains a unified representation for both vision and text, achieving strong VQA results.
    • VisualBERT: It incorporates visual region features early into Bidirectional Encoder Representations from Transformers (BERT) text transformer, improving captioning and VQA.
    • Applications: Accessibility tools (e.g., image descriptions for the visually impaired), visual assistants, and content tagging.
  • 生成式虚拟语言模型:这些模型可以根据跨模态的输入生成新的图像或详细文本:
    • DALL·E 2 (OpenAI) :根据文本提示生成逼真的图像。
    • Imagen(谷歌):最先进的文本到图像生成模型,专注于超高质量图像。
    • Parti(谷歌):通过对视觉标记序列进行建模来生成图像的自回归模型。
    • 应用领域:创意设计、游戏开发、营销视觉、虚拟世界。
  • Generative VLMs: These models generate new images or detailed text from inputs across modalities:
    • DALL·E 2 (OpenAI): Generates photorealistic images from text prompts.
    • Imagen (Google): State-of-the-art text-to-image generation model, focusing on ultra-high-quality images.
    • Parti (Google): Autoregressive model generating images by modeling sequences of visual tokens.
    • Applications: Creative design, game development, marketing visuals, virtual worlds.
  • 多模态推理 VLM :这些模型超越了检索或生成,能够跨模态执行逻辑推理,例如推断图像和文本之间的关系,或解决复杂的视觉文本任务:
    • Flamingo(DeepMind):对图像和文本进行灵活条件反射的少样本视觉推理。
    • Kosmos-1(微软):将多模态推理扩展到文档、图像、光学字符识别( OCR ) 文本,甚至视觉数学问题。
    • 应用领域:多模态聊天机器人、文档理解和教育技术。
  • Multimodal reasoning VLMs: These models go beyond retrieval or generation, performing logical reasoning across modalities, such as inferring relationships between images and text, or solving complex visual-text tasks:
    • Flamingo (DeepMind): Few-shot visual reasoning with flexible conditioning on images and text.
    • Kosmos-1 (Microsoft): Extends multimodal reasoning to documents, images, Optical Character Recognition (OCR) text, and even visual math problems.
    • Applications: Multimodal chatbots, document understanding, and education technologies.
  • 指令调整的虚拟语言模型以下最新模型使用类似人类的指令进行调整,提高了它们在无需显式重新训练的情况下泛化到新任务的能力:
    • BLIP(Salesforce):从图像描述和检索中进行引导学习;既可以生成图像描述,也可以检索图像。
    • BLIP-2(Salesforce):将视觉编码器与冻结的语言模型(例如 OPT、Flan-T5)连接起来,用于零样本 VQA 和生成。
    • MiniGPT-4 :一种构建类似 MiniGPT-4 的多模态模型的轻量级开源方法。
    • 应用领域:零样本图像标注、面向任务的多模态系统、机器人感知。
  • Instruction-tuned VLMs: The following recent models are aligned using human-like instructions, improving their ability to generalize across new tasks without explicit retraining:
    • BLIP (Salesforce): Bootstrap learning from captioning and retrieval; can both generate captions and retrieve images.
    • BLIP-2 (Salesforce): Connects a vision encoder with a frozen language model (e.g., OPT, Flan-T5) for zero-shot VQA and generation.
    • MiniGPT-4: Lightweight, open-source approach to build a MiniGPT-4-like multimodal model.
    • Applications: Zero-shot captioning, task-oriented multimodal systems, robotic perception.

视觉语言模型的核心架构组件

Core architectural components of vision-language models

尽管任务各不相同,但大多数虚拟学习模块都遵循一些共同的设计原则,例如:

Despite task differences, most VLMs share common design principles, like the following:

  • 交叉编码器与双向编码器
    • 交叉编码器在推理过程中联合处理图像和文本,从而实现更丰富的交互,但检索速度较慢。它将图像和文本输入一起接收,并通过一个联合的Transformer模型进行处理。图像通常表示为一系列视觉标记(例如,来自视觉Transformer ( ViT ) 或具有空间展平的卷积神经网络( CNN )),而文本则按常规方式进行标记化。

      然后将这两个序列连接起来,并传递给一个转换器,该转换器对整个连接序列执行自注意力机制,这意味着:

      • 文本标记关注图像标记。
      • 图像标记关注文本标记。
      • 令牌在其模态内也会相互关注。

      这种完全的交叉注意力机制使得视觉和语言表征在每一层都能进行丰富而精细的交互。例如,像“狗”这样的词可以关注显示狗的特定图像块,反之亦然。

      图 2.1展示了一种基于交叉编码器的 VLM 架构,该架构专为需要深度理解图像及其对应文本描述的任务而设计,例如将产品照片与详细规格进行匹配。与为图像和文本分别生成嵌入向量的双编码器不同,该方法不依赖于嵌入向量,而是通过将输入作为组合对进行处理来计算直接相关性得分。通过融合注意力机制或交叉注意力机制,该模型能够捕捉跨模态的细粒度交互。这种设置非常适合对对齐精度要求极高的场景,例如电子商务产品验证、视频问答 (VQA) 和多模态文档理解。

  • Cross-encoders vs. bi-encoders:
    • Cross-encoders jointly process both image and text during inference, allowing richer interaction but slower retrieval. It takes both the image and text input together and processes them through a single, joint transformer model. The image is usually represented as a sequence of visual tokens (e.g., from a Vision Transformer (ViT) or a convolutional neural network (CNN) with spatial flattening), and the text is tokenized as usual.

      These two sequences are then concatenated and passed into a transformer that performs self-attention across the entire joint sequence, meaning:

      • Text tokens attend to image tokens.
      • Image tokens attend to text tokens.
      • Tokens attend to each other within their modality as well.

      This full cross-attention allows rich, fine-grained interactions between vision and language representations at every layer. For example, a word like dog can attend to specific image patches showing the dog, and vice versa.

      Figure 2.1 depicts a cross-encoder-based VLM architecture designed for tasks requiring deep joint understanding of images and their corresponding textual descriptions, such as matching product photos with detailed specifications. Unlike dual encoders that generate separate embeddings for images and text, this approach does not rely on embeddings but instead computes a direct relevance score by processing the input as a combined pair. Through merged attention or cross-attention mechanisms, the model captures fine-grained interactions across modalities. This setup is ideal for scenarios where alignment precision is critical, such as e-commerce product verification, VQA, and multimodal document understanding.

图示展示了产品图片、文字描述和产品规格如何输入到编码器或交叉编码器中。该过程仅输出一个分数(不包含嵌入向量),然后将该分数输入到融合注意力/交叉注意力模型中。

图 2.1:交叉编码器联合处理图像和文本

Figure 2.1: Cross-encoders jointly process both image and text

  • 双编码器分别对图像和文本进行编码,然后比较它们的嵌入,从而实现更快、可扩展的检索(如 CLIP 中所用)。

    图像和文本分别使用两个独立的编码器进行编码,生成各自的嵌入向量。通常情况下,视觉编码器(例如,ViT 或 CNN)处理图像并生成图像嵌入向量;文本编码器处理文本嵌入向量。(例如,基于Transformer的语言模型,如BERT)处理文本并生成文本嵌入向量。

    重要的是,图像和文本在编码过程中不会相互干扰。

    在推理过程中,图像特征和文本特征之间不存在交叉注意力。

    分别生成两个嵌入向量后,在编码后对它们进行比较,通常通过计算相似度得分来实现,例如:

    • 余弦相似度
    • 点积
    • 欧氏距离

    下图展示了两个嵌入在向量空间中越接近,它们之间的相关性就越高:

  • Bi-encoders encode image and text separately and compare their embeddings later, enabling faster, scalable retrieval (as used in CLIP).

    The image and text are encoded separately into their own embeddings using two independent encoders, typically: a vision encoder (e.g., a ViT or CNN) processes the image and produces an image embedding vector. A text encoder (e.g., a transformer-based language model like BERT) processes the text and produces a text embedding vector.

    Importantly, the image and text do not interact during encoding.

    There is no cross-attention between image and text features during inference.

    Once both embeddings are generated independently, they are compared after encoding, often by computing a similarity score such as:

    • Cosine similarity
    • Dot product
    • Euclidean distance

    The following figure depicts how closer the two embeddings are in the vector space, the more relevant they are considered to each other:

图示展示了代表产品的计算机、文本和图像。文本和图像对分别由图像编码器和文本编码器进行编码,组合成嵌入向量,并发送到堆叠式Transformer模型。

图 2.2:双编码器分别对图像和文本进行编码。

Figure 2.2: Bi-encoders encode image and text separately

  • 融合机制:一些模型(如 LXMERT、VisualBERT)在模型训练期间使用早期或晚期融合来结合视觉和文本信息。

    在视觉语言模型(VLM)中,融合机制指的是如何将来自不同模态的信息(通常是图像的视觉特征和语言的文本特征)结合起来,形成联合表示。有效的融合对于使模型能够处理视觉和文本输入至关重要。

  • Fusion mechanisms: Some models (like LXMERT, VisualBERT) use early or late fusion to combine visual and textual information during model training.

    Fusion mechanisms in VLMs refer to how information from different modalities, typically visual features from images and textual features from language, is combined to form a joint representation. Effective fusion is crucial for enabling models to reason across both vision and text inputs.

融合策略有多种类型。早期融合在输入阶段将图像和文本嵌入结合起来,使模型能够从一开始就联合学习跨模态交互。后期融合则分别处理每种模态,并在后期将它们的输出合并。中间融合阶段通常在最终决策之前进行。中间融合(或跨模态融合)在部分处理后组合特征,从而在模型的前向传播过程中实现更复杂的模态间交互。融合机制通常使用交叉注意力层来实现,其中来自一种模态(例如图像区域)的特征会关注来自另一种模态(例如文本标记)的特征。这类似于Transformer模型使用注意力来关联序列的不同部分,但这里的注意力是跨模态的。交叉注意力使模型能够在处理文本时选择性地关注图像的相关部分,反之亦然。

There are several types of fusion strategies. Early fusion combines image and text embeddings at the input stage, allowing the model to jointly learn cross-modal interactions from the beginning. Late fusion processes each modality separately and merges their output at a later stage, typically before final decision-making. Intermediate fusion (or cross-modal fusion) combines features after partial processing, allowing for more sophisticated interactions between modalities during the model's forward pass. Fusion mechanisms are often implemented using cross-attention layers, where features from one modality (e.g., image regions) attend to features from the other modality (e.g., text tokens). This is similar to how transformers use attention to relate different parts of a sequence, but here the attention operates across modalities. Cross-attention enables models to selectively focus on relevant parts of an image when processing text and vice versa.

因此,虽然融合机制广义上是指模态的组合,但交叉注意力是一种经常在融合策略中使用的具体技术。

Thus, while fusion mechanisms refer broadly to the combining of modalities, cross-attention is a specific technique often used within fusion strategies.

视觉学习模型(VLM)代表了人工智能的一次关键演进,它将计算机视觉和自然语言处理NLP )的优势融合到统一而强大的架构中。从检索和图像描述到多模态推理和指令调优,VLM 正在为下一代智能系统铺平道路,这些系统能够通过多种感官与世界互动。在接下来的章节中,我们将深入探讨多模态全人类人工智能(GenAI)系统,VLM 的能力和局限性将提供一个重要的参考点,凸显构建真正多功能、类人人工智能的机遇和挑战。

VLMs represent a critical evolution in AI, merging the strengths of computer vision and natural language processing (NLP) into unified, powerful architectures. From retrieval and captioning to multimodal reasoning and instruction-tuning, VLMs are paving the way for the next-generation of intelligent systems capable of interacting with the world through multiple senses. As we proceed deeper into multimodal GenAI systems in the next sections, the capabilities and limitations of VLMs provide a vital reference point, highlighting the opportunities and challenges of building truly versatile, human-like AI.

视觉语言模型面临的挑战

Challenges in vision-language models

尽管虚拟语言模型在视觉质量评估、图像描述和跨模态检索等任务中取得了越来越大的成功,但它们仍面临着一些关键挑战,限制了其更广泛的应用和实际部署。这些挑战既源于架构上的局限性,也源于数据、性能和泛化方面的实际限制:

Despite their growing success across tasks such as VQA, image captioning, and cross-modal retrieval, VLMs face several critical challenges that limit their broader applicability and real-world deployment. These challenges stem from both architectural limitations and practical constraints in data, performance, and generalization:

  • 数据需求和质量:视觉学习模型(VLM)需要大量的对齐图像-文本数据来学习有意义的跨模态表征。COCO、Visual Genome 或 LAION 等高质量数据集提供了一个起点,但它们通常偏向于西方、以互联网为中心的内容。这限制了模型在特定领域或非英语环境中的泛化能力。此外,大规模创建精心整理、对齐良好的图像-文本对成本高昂,而且噪声字幕会降低模型学习效果。
  • Data requirements and quality: VLMs require vast amounts of aligned image-text data to learn meaningful cross-modal representations. High-quality datasets like COCO, Visual Genome, or LAION provide a starting point, but they are often biased toward Western, internet-centric content. This restricts generalization to domain-specific or non-English environments. Moreover, creating curated, well-aligned image-text pairs at scale is expensive, and noisy captions can degrade model learning.
  • 模态失衡:在许多情况下,由于文本模态语义丰富,它在模型学习中占据主导地位,导致视觉信号利用不足。这种失衡降低了视觉-文本融合的有效性,并导致在需要精细视觉理解的任务(例如物体定位或场景描述)上表现欠佳。
  • Modality imbalance: In many cases, the textual modality dominates model learning due to its semantic richness, leading to underutilization of the visual signal. This imbalance reduces the effectiveness of vision-text fusion and results in suboptimal performance on tasks requiring fine-grained visual understanding, such as object grounding or scene description.
  • 多模态推理能力有限:虽然视觉学习模型能够对齐文本和图像特征,但它们跨模态进行逻辑推理的能力仍然较弱。它们难以处理需要时间理解、数值推理或多步骤推理的任务,尤其是在需要同时从视觉元素和文本上下文中提取信息时。
  • Limited multimodal reasoning: While VLMs can align text and image features, their ability to perform logical reasoning across modalities remains weak. They struggle with tasks requiring temporal understanding, numerical reasoning, or multi-step inference, especially when information must be drawn jointly from visual elements and textual context.
  • 跨领域泛化能力:大多数虚拟语言模型(VLM)都是在网络规模的通用数据集上训练的,因此难以泛化到特定领域,例如医学影像、科学文献或工业环境。微调有所帮助,但这通常需要目标领域的大型标注数据集,而这些数据集可能并不容易获取。
  • Generalization across domains: Most VLMs are trained on web-scale and generic datasets and fail to generalize to specialized domains, such as medical imaging, scientific literature, or industrial settings. Fine-tuning helps, but it often requires large labeled datasets in the target domain, which may not be readily available.
  • 效率和延迟:视觉语言模型(VLM)的训练和部署都需要耗费大量的计算资源。采用视觉和文本标记交叉注意力机制的架构(例如 UNITER、LXMERT)在推理阶段的扩展性较差。这限制了它们在低延迟或资源受限环境(例如移动设备或边缘计算)中的应用。
  • Efficiency and latency: VLMs are computationally expensive to train and deploy. Architectures that use cross-attention between vision and text tokens (e.g., UNITER, LXMERT) scale poorly at inference time. This limits their feasibility in low-latency or resource-constrained environments, such as mobile devices or edge computing.
  • 缺乏工具集成:与模块化的 GenAI 系统不同,大多数 VLM 并非设计用于与数据库、API 或向量存储等外部工具对接。这限制了它们在需要上下文关联或外部内存访问的动态环境中的实用性。
  • Lack of tool integration: Unlike modular GenAI systems, most VLMs are not designed to interface with external tools, like databases, APIs, or vector stores. This limits their utility in dynamic environments where contextual grounding or external memory access is required.

克服这些挑战需要在数据管理、模型架构、训练效率以及与检索或编排系统的集成方面进行创新,其中许多挑战在更广泛的多模态 GenAI 框架中得到了解决。

Overcoming these challenges requires innovations in data curation, model architecture, training efficiency, and integration with retrieval or orchestration systems, many of which are addressed in broader multimodal GenAI frameworks.

多模态GenAI系统

Multimodal GenAI system

训练视觉学习模型(VLM)是一个极其消耗资源的过程。这些模型需要数百万甚至数十亿个对齐的图像-文本对才能学习有意义的多模态表示。整理如此庞大的数据集需要付出大量努力,包括数据收集、清洗、质量过滤,有时还需要人工标注以确保正确对齐。除了数据成本之外,计算成本也非常高。VLM 通常使用大型架构,例如用于图像的 ViT 和用于文本的基于 Transformer 的编码器。从头开始训练它们需要运行数周甚至数月的大量 GPU 或 TPU 集群。例如,像 CLIP(OpenAI)和 ALIGN(Google)这样的模型,由于硬件、存储和能源成本的限制,普通组织很难复制其训练数据集。此外,要实现良好的泛化能力,需要涵盖广泛的视觉和文本概念的多样化数据集,这进一步增加了数据采集的难度。对于大多数组织来说,微调预训练的 VLM 更为可行,但如果需要大规模的领域自适应,即使是这种方法也可能成本高昂。

Training VLMs is an extremely resource-intensive process. These models require millions or even billions of aligned image-text pairs to learn meaningful multimodal representations. Curating such massive datasets involves substantial effort, including data collection, cleaning, filtering for quality, and sometimes human labeling to ensure proper alignment. Beyond data, computational costs are also very high. VLMs typically use large architectures, such as ViT for images and transformer-based encoders for text. Training them from scratch demands extensive GPU or TPU clusters running for weeks or even months. For example, models like CLIP (OpenAI) and ALIGN (Google) were trained on datasets that regular organizations cannot easily replicate due to hardware, storage, and energy costs. Moreover, achieving good generalization requires diverse and broad datasets, covering a wide range of visual and textual concepts, further increasing data acquisition challenges. Fine-tuning a pretrained VLM is more feasible for most organizations, but even that can be expensive if large-scale domain adaptation is needed.

因此,虽然从零开始开发虚拟语言模型(VLM)能够提供完全的控制权和潜在的创新空间,但其成本往往高得令人望而却步。如今,许多实际系统依赖于对开源预训练的VLM进行微调或适配,而不是训练全新的模型。构建多模态随机生成(RAG)系统是训练大型VLM的替代方案。如图2.3所示,在多模态RAG中,独立的检索器从外部源获取相关的文本、图像或混合模态数据,生成器则基于检索到的信息合成响应。这种方法利用现有的多模态嵌入和向量数据库,避免了大规模预训练的需要。它允许灵活地将文本、图像或两者作为下游任务(例如问答、图像描述或摘要)的上下文信息进行集成,使其成为一种……一种更高效、可扩展的多模态人工智能系统部署方法,无需承担端到端训练的高昂成本。

Therefore, while developing a VLM from scratch offers full control and potential innovation, it is often prohibitively expensive. Many practical systems today rely on fine-tuning or adapting open-source pretrained VLMs instead of training entirely new models. An alternative to training large VLMs from scratch is building multimodal RAG systems. In multimodal RAG, as shown in Figure 2.3, separate retrievers fetch relevant text, image, or mixed-modal data from external sources, and a generator synthesizes a response based on the retrieved information. This approach bypasses the need for massive pretraining by leveraging existing multimodal embeddings and vector databases. It allows flexible integration of text, images, or both as context for downstream tasks like QA, captioning, or summarization, making it a more efficient and scalable method for deploying multimodal AI systems without the heavy costs of end-to-end training.

流程图展示了一个多模态搜索系统:查询被嵌入,文档和图像被分块并嵌入,存储在矢量数据库中,进行搜索,并通过大型语言模型生成结果。

图 2.3:多模态 RAG 系统,使用两种嵌入模型,一种用于文本,一种用于图像。

Figure 2.3: Multimodal RAG system, using two embedding models, one for text and one for image

让我们来了解一下多模态 RAG 系统,它是一种无需从头开始训练大型虚拟语言模型 (VLM) 即可高效构建多模态 AI 能力的方法。该系统能够智能地利用文本和图像数据来检索和生成答案。以下是该过程的详细步骤说明:

Let us understand the multimodal RAG system, an efficient way to build multimodal AI capabilities without training large VLMs from scratch. This system intelligently retrieves and generates answers by leveraging both text and image data. The following is a detailed step-by-step explanation of the process:

  • 用户查询提交:用户通过提交查询来启动该过程。

    查询语句可以是:

    • 仅文本(例如,寻找带有良好摄像头的智能手机),
    • 仅图片(例如,上传智能手机照片),
    • 文字和图像的组合(例如,带有文字提示的照片,显示类似的模型)。

    该系统必须能够处理不同的模态并对其进行适当的解释。

  • User query submission: A user initiates the process by submitting a query.

    The query can be:

    • Text-only (e.g., find smartphones with a good camera),
    • Image-only (e.g., uploading a smartphone photo),
    • A combination of text and image (e.g., a photo with the text prompt, show similar models).

    The system must handle different modalities and interpret them appropriately.

  • 嵌入生成:查询内容(无论是文本、图像还是两者兼有)都会通过专门的嵌入模型进行处理,例如以下模型:
    • 文本嵌入模型:将文本输入转换为能够捕捉其语义含义的密集向量。
    • 图像嵌入模型:将图像转换为类似的高维向量空间表示。

    通过将两种模态编码到共享的嵌入空间中,该系统确保可以对文本和图像中的相似概念进行有意义的比较。

  • Embedding generation: The query, whether text, image, or both, is processed through specialized embedding models, like the following:
    • Text embedding model: Converts the text input into a dense vector that captures its semantic meaning.
    • Image embedding model: Converts the image into a similar high-dimensional vector space representation.

    By encoding both modalities into a shared embedding space, the system ensures that similar concepts from text and images can be compared meaningfully.

  • 知识库预处理:为检索做准备:
    • 文档被分成更小的、有意义的部分,称为块(例如,段落、章节)。
    • 如果图像有相应的文本元数据,也会对其进行处理并关联。
    • 文本块和图像都使用与查询不同的嵌入模型进行嵌入。但是请注意,相同的嵌入模型也可以在特定情况下使用,我们将在后续章节中讨论这一点。
    • 生成的嵌入向量存储在多模态向量数据库中。

    这一步骤确保所有知识资产(文本和图像)都可以通过矢量相似性进行搜索。

  • Preprocessing of knowledge base: In preparation for retrieval:
    • Documents are broken into smaller, meaningful parts called chunks (e.g., paragraphs, sections).
    • Images are also prepared and associated with their corresponding textual metadata if available.
    • Both text chunks and images are embedded using the different embedding models used for the query. However, please take note that the same embedding models can also be used in specific contexts, and we will talk about this in the following chapters.
    • The resulting embeddings are stored inside a multimodal vector database.

    This step ensures that all knowledge assets—text and images—are searchable through vector similarity.

  • 向量数据库搜索:查询嵌入后:
    • 该系统针对多模态向量数据库执行向量搜索。
    • 它会找到与用户查询语义最相似的文档、段落或图像。
    • 检索是在嵌入空间中进行的,这使得匹配可以超越精确的关键词或像素相似性,实现更灵活的匹配。

    此搜索步骤确保获取最相关的知识片段,无论其形式如何。

  • Vector database search: Once the query is embedded:
    • The system performs a vector search against the multimodal vector database.
    • It finds the most semantically similar documents, paragraphs, or images to the user's query.
    • The retrieval happens in embedding space, which allows for flexible matching beyond exact keyword or pixel similarity.

    This search step ensures that the most relevant knowledge pieces, irrespective of modality, are fetched.

  • 检索结果汇总
    • 矢量搜索结果包括文本块、图像或两者兼有。
    • 从检索到的上下文中得到的这些结果将有助于最终响应的生成。
    • 检索到的内容为生成模型提供了事实基础,提高了模型的准确性和相关性。
  • Retrieved results consolidation:
    • The vector search results include text chunks, images, or both.
    • These results from the retrieved context will assist the final response generation.
    • Retrieved content provides factual grounding for the generative model, improving its accuracy and relevance.
  • LLM 的响应生成
    • 然后将检索到的多模态内容输入到 LLM 中。
    • LLM 使用检索到的知识作为上下文,综合出最终答案或输出。
    • 它可以根据查询内容,结合文字解释、描述图像或生成创意输出。

    这种设计确保模型不会凭空产生答案,而是将答案建立在实际检索到的知识之上。

  • Response generation by LLM:
    • The retrieved multimodal content is then fed into a LLM.
    • The LLM synthesizes the final answer or output using the retrieved knowledge as context.
    • It may combine text explanations, describe images, or generate creative outputs depending on the query.

    This design ensures that the model does not hallucinate answers but grounds them in actual retrieved knowledge.

  • 返回结果
    • 最终结果将返回给用户。
    • 输出结果可能包括基于文本的答案、图像引用或基于检索信息的多模态摘要。

    因此,用户可以通过检索和增强高效地获得高质量的响应。

  • Returning the result:
    • The final result is returned to the user.
    • The output could include text-based answers, references to images, or multimodal summaries based on the retrieved information.

    Thus, users receive high-quality responses generated efficiently through retrieval and augmentation.

这种多模态 RAG 架构高效地融合了文本和图像检索与生成能力。它无需进行大规模的虚拟模型预训练,降低了计算成本,并实现了多模态系统的可扩展部署。通过分离检索和生成过程,组织可以利用现有的嵌入模型和逻辑层模型构建强大的 AI 解决方案,使其成为实际多模态 AI 应用的理想选择。

This multimodal RAG architecture efficiently merges text and image retrieval with generative capabilities. It bypasses the need for massive VLM pretraining, reduces computational costs, and enables scalable deployment of multimodal systems. By separating retrieval and generation, organizations can build powerful AI solutions with existing embedding models and LLMs, making it an attractive option for real-world multimodal AI applications.

多模态向量嵌入

Multimodal vector embedding

现在您应该明白,在全人类人工智能(GenAI)时代,跨多种模态(文本、图像、音频、视频和结构化数据)处理数据的能力不再是锦上添花,而是必不可少。多模态红绿灯(RAG)系统正处于这一演进的前沿,它通过从各种数据源检索相关信息来增强逻辑逻辑模型(LLM),从而实现更具上下文关联性、信息量更大、更接近人类的响应。然而,此类系统的有效性很大程度上取决于其底层向量表示,特别是生成多模态向量嵌入的能力,该向量嵌入能够将不同格式的信息统一到一个可比较的、语义丰富的空间中。

Now you know that in the era of GenAI, the ability to work across multiple modalities, text, images, audio, video, and structured data, is no longer a luxury but a necessity. Multimodal RAG systems are at the forefront of this evolution, enabling more context-rich, informative, and human-like responses by augmenting LLMs with relevant information retrieved from diverse data sources. However, the effectiveness of such systems is heavily dependent on their underlying vector representations, specifically, the ability to generate multimodal vector embeddings that unify information across formats in a comparable, semantically rich space.

多模态向量嵌入至关重要,因为它们构成了 RAG 流程中相似性搜索的核心。对于仅限于文档、网页或文本知识库的应用,标准的纯文本 RAG 系统或许就足够了。然而,现实世界的信息通常是多模态的。例如,用户手册包含文本和图表;产品规格包含表格数据和带注释的图像;客户支持互动可能涉及语音转录和屏幕截图。如果系统无法同时理解和检索这些异构格式中的相关信息,就会错过关键信号,导致生成质量欠佳。

Multimodal vector embeddings are essential because they form the backbone of similarity search in a RAG pipeline. A standard text-only RAG system may suffice for applications limited to documents, webpages, or textual knowledge bases. However, real-world information is often multimodal. For example, user manuals contain both text and diagrams; product specifications include tabular data and annotated images; customer support interactions may involve voice transcripts and screenshots. A system that cannot simultaneously understand and retrieve relevant information from these heterogeneous formats will miss critical signals, leading to suboptimal generation quality.

为了实现跨模态检索,每段内容(无论是图像、段落还是音频片段)都必须嵌入到向量空间中。然而,与所有嵌入都源自同一编码器并存在于统一潜在空间的单模态系统不同,多模态系统需要更复杂的设计。如图2.2所示,通常使用不同的编码器(例如,图像使用 CLIP,文本使用Sentence Transformer ,音频使用 Whisper)来生成特定模态的嵌入。这些嵌入必须映射到共享的潜在空间,或者通过索引策略进行链接,以便高效地计算跨模态的相似度。

To enable cross-modal retrieval, each piece of content, whether it is an image, paragraph, or audio clip, must be embedded into a vector space. However, unlike unimodal systems, where all embeddings are derived from the same encoder and live in a uniform latent space, multimodal systems require a more sophisticated design. As explained in Figure 2.2, separate encoders (e.g., CLIP for images, Sentence Transformers for text, and Whisper for audio) are often used to generate modality-specific embeddings. These embeddings must then either be mapped into a shared latent space or linked via indexing strategies that allow for efficient similarity computation across modalities.

例如,假设用户上传了一张笔记本电脑侧面轮廓图,并询问“给我展示带有类似这种接口的笔记本电脑” 。单模态 RAG 系统无法识别这张图片。相比之下,采用联合向量嵌入的多模态 RAG 系统可以将这张图片与数据库中存储的类似笔记本电脑接口图进行匹配,并检索相应的产品规格和评价。这种检索之所以可行,是因为视觉信息和文本信息都被表示为共享或对齐空间中的向量,从而保留了语义信息。

For example, consider a user asking, show me laptops with ports like this, while uploading an image of a laptop side profile. A unimodal RAG system would fail to interpret the image. In contrast, a multimodal RAG system with joint vector embeddings can match the image to similar laptop port diagrams stored in the database and retrieve corresponding product specifications and reviews. This retrieval is only possible because the visual and textual information are both represented as vectors in a shared or aligned space that preserves semantic meaning.

多模态向量嵌入也增强了查询构建的灵活性。用户可以输入图像、文本,甚至二者的组合,系统可以将其与相关的文档、图表或知识库进行匹配。这使得系统更加直观和包容,能够跨越语言障碍,并方便那些可能没有精确关键词但拥有视觉或听觉线索的用户。

Multimodal vector embeddings also enhance the flexibility of query formulation. Users can input images, text, or even a combination of both, and the system can match them against relevant documents, diagrams, or knowledge chunks. This makes the system more intuitive and inclusive, bridging language barriers and accommodating users who may not have the precise keywords but possess visual or auditory cues.

此外,在专为医疗保健、法律或制造业等高风险领域设计的 RAG 系统中,使用多模态嵌入可以确保答案生成拥有更全面的证据基础。它通过将答案生成与真实的多模态数据样本挂钩,而不是仅仅依赖于先验模型知识,从而降低了出现幻觉的风险。

Furthermore, in RAG systems designed for high-stakes domains like healthcare, legal, or manufacturing, the use of multimodal embeddings ensures a more comprehensive evidence base for answer generation. It reduces the risk of hallucinations by anchoring the generation to real, multimodal data artifacts rather than relying purely on prior model knowledge.

多模态向量数据库

Multimodal vector database

一旦生成了多模态向量嵌入(用于在共享语义空间中表示文本、图像或两者),就必须对其进行高效存储和检索,以支持实时人工智能应用。这时,多模态向量数据库就显得至关重要。它提供了一个结构化的高性能存储系统,针对不同模态嵌入的相似性搜索进行了优化。通过将这些嵌入与元数据(例如语言、时间戳)一起组织起来,向量数据库能够实现快速、过滤的近似最近邻ANN )检索。从嵌入到向量数据库的这种转变对于构建可扩展的跨模态系统(例如多模态红绿灯算法、推荐引擎和语义搜索平台)至关重要。

Once multimodal vector embeddings are generated, representing text, images, or both in a shared semantic space, they must be efficiently stored and retrieved to support real-time AI applications. This is where a multimodal vector database becomes essential. It provides a structured, high-performance storage system optimized for similar search across embeddings from different modalities. By organizing these embeddings alongside metadata (e.g., language, timestamp), the vector database enables fast, filtered approximate nearest neighbor (ANN) retrieval. This transition from embeddings to a vector database is crucial for powering scalable, cross-modal systems such as multimodal RAG, recommendation engines, and semantic search platforms.

例如:

Examples:

  • 只要满足以下条件,Qdrant、Weaviate、Pinecone、Milvus 和 Chroma 都可以用作多模态矢量数据库:
    • 您可以将来自不同模态的嵌入规范化到同一维度(通常是必要的)。
    • 您可以使用适当的元数据标签(例如,“type”:“text”或“type”:“image” )来控制检索行为。
  • Qdrant, Weaviate, Pinecone, Milvus, and Chroma can all be used as multimodal vector databases, provided:
    • You normalize embeddings from different modalities into the same dimension (often necessary).
    • You use appropriate metadata tags (e.g., "type": "text" or "type": "image") to control retrieval behavior.

在使用多模态向量数据库时,例如使用 Qdrant 作为存储和检索高维多模态嵌入的向量数据库,您必须考虑一些关键的设计选择。

You have to touch on some critical design choices when using a multimodal vector database, let us say Qdrant as a vector database for storing and retrieving high-dimensional multimodal embeddings.

让我们结合流行的矢量数据库 Qdrant,具体了解几个关键概念。虽然大多数矢量数据库的运行原理类似,但逐一详细介绍每个数据库超出了本章和本书的范围。

Let us understand a few key concepts specifically in the context of Qdrant, a popular vector database. While most vector databases operate on similar principles, detailing each one individually is beyond the scope of this chapter and book.

收藏

Collections

在 Qdrant 中,集合是基本的组织单元。它本质上是一组带有标签的数据点,这些数据点共享一个共同的结构。集合中的每个点都与一个固定大小的向量相关联,并使用特定的相似度度量(例如,余弦相似度、点积、欧氏距离)进行比较。同一集合中的所有向量必须遵循统一的维度和距离函数。Qdrant 还允许在单个点中使用不同的名称存储多个向量,称为命名向量每个命名向量可以单独遵循不同的度量和维度设置。

A collection is the fundamental organizational unit in Qdrant. It is essentially a labeled group of data points that share a common structure. Each point in a collection is associated with a vector of a fixed size and is compared using a specific similarity metric (e.g., cosine, dot product, Euclidean). All vectors in the same collection must adhere to this uniform dimensionality and distance function. Qdrant also allows multiple vectors to be stored under different names within a single point called named vectors, which can individually follow different metric and dimension settings.

点和点 ID

Points and point IDs

在 Qdrant 中,点是指集合中的一个独立条目。它包含以下内容:

In Qdrant, a point is an individual entry within a collection. It comprises of the following:

  • 唯一标识符(点 ID)。
  • A unique identifier (point ID).
  • 一个或多个向量嵌入。
  • One or more vector embeddings.
  • 可选元数据,称为有效载荷
  • Optional metadata known as payload.

这些点是用户使用向量相似度进行搜索的基本单元。点 ID 用于检索、更新或删除特定记录。所有与点相关的操作,包括插入或更新,都会首先被记录,以确保数据的持久性和恢复能力,即使在断电的情况下也能正常运行。

These points are the basic units that users search against using vector similarity. The point ID is used to retrieve, update, or delete specific records. All point-related operations, including insertions or updates, are first logged to ensure durability and recovery, even in the event of power failure.

向量

Vectors

向量(也称为嵌入)表示各种数据类型(例如图像、文本或音频)的编码数值形式。这些向量使得在高维空间中比较不同的数据对象成为可能。两个向量在这个空间中越接近,它们所代表的原始对象就越相似。为了生成这些嵌入,通常使用神经网络,该网络经过训练以学习有意义的模式,通常基于对已标记或弱标记数据的对比学习。向量是相似性搜索的基石,并被应用于聚类、排序和检索任务中。

Vectors (also known as embeddings) represent the encoded numerical form of various data types, such as images, text, or audio. These vectors enable the comparison of different data objects in high-dimensional space. The closer two vectors are in this space, the more similar their original objects are considered to be. To generate these embeddings, one typically uses a neural network trained to learn meaningful patterns, often based on contrastive learning from labeled or weakly labeled data. Vectors are the cornerstone of similarity search and are used in clustering, ranking, and retrieval tasks.

有效载荷

Payload

有效载荷是指与每个向量一起存储的附加元数据。这些元数据非常灵活,可以采用任何 JSON 兼容的结构。它可以描述语言、时间戳、用户信息、类别或任何特定领域的标签等属性。有效载荷使 Qdrant 能够执行过滤搜索,允许用户将相似性搜索限制在具有特定属性的向量上。 元数据属性。例如,仅检索英文文档或按日期筛选。

The payload refers to additional metadata stored alongside each vector. This metadata is flexible and can take any JSON-compatible structure. It can describe attributes like language, timestamp, user information, category, or any domain-specific tags. Payloads allow Qdrant to perform filtered searches, letting users restrict similarity searches to vectors with certain metadata properties. For example, retrieving only English-language documents or filtering by date.

存储和矢量存储

Storage and vector store

Qdrant 将数据组织成各个集合内的分段。每个分段都维护着自己的一组向量、有效载荷和索引。分段针对不同的使用场景进行了优化,例如:

Qdrant organizes its data into segments within each collection. Each segment maintains its own set of vectors, payloads, and indexes. Segments are optimized for different use cases, like the following:

  • 可追加段支持快速插入、更新和删除。
  • Appendable segments support fast inserts, updates, and deletions.
  • 不可追加段针对静态数据或读取密集型数据进行了优化。
  • Non-appendable segments are optimized for static or read-heavy data.

Qdrant 支持两种存储模型,分别是:

Qdrant supports two storage models, which are as follows:

  • 内存存储:将所有向量数据保存在 RAM 中,以实现最佳性能,磁盘仅用于持久化存储。
  • In-memory storage: Keeps all vector data in RAM for maximum performance, with disk used only for persistence.
  • 内存映射存储:将磁盘文件链接到虚拟内存,通过利用操作系统的页面缓存,在速度和内存使用之间取得平衡。
  • Memory-mapped storage: Links disk files to virtual memory, offering a balance between speed and memory usage by leveraging the operating system’s page cache.

这种架构确保可以根据应用程序需求调整性能和成本。

This architecture ensures that performance and cost can be tuned based on application requirements.

索引

Indexing

在实际的多模态 GenAI 系统中,高效的数据管理和检索对于在图像和文本模态上提供快速、准确的响应至关重要。高性能矢量数据库 Qdrant 通过结合矢量索引和有效载荷过滤,实现了这一目标,确保语义相似性和结构化元数据约束都能得到无缝处理。Qdrant 利用集合、点级元数据(有效载荷)以及来自 CLIP 或 BLIP 等模型的高维嵌入,支持混合搜索——根据语义和产品类别或颜色等过滤器检索相关项。这些索引策略,包括分层可导航小世界( HNSW ) 索引和有效载荷索引,确保 GenAI 应用在保持低延迟性能的同时,实现可靠的扩展。Qdrant 同时支持矢量索引和有效载荷(过滤器)索引,从而实现高效的混合搜索:

In a real-world multimodal GenAI system, efficient data management and retrieval are essential for delivering fast, accurate responses across image and text modalities. Qdrant, a high-performance vector database, enables this by combining vector indexing and payload filtering, ensuring both semantic similarity and structured metadata constraints are handled seamlessly. By leveraging collections, point-level metadata (payloads), and high-dimensional embeddings from models like CLIP or BLIP, Qdrant facilitates hybrid search—retrieving relevant items based on meaning and filters like product category or color. These indexing strategies, including hierarchical navigable small world (HNSW) and payload indexes, ensure GenAI applications scale reliably while maintaining low-latency performance. Qdrant supports both vector indexing and payload (filter) indexing, allowing efficient hybrid search:

  • 向量索引(例如 HNSW)通过将向量组织成图结构来加速相似性搜索,从而减少搜索空间。
  • Vector indexes (e.g., HNSW) accelerate similarity searches by organizing vectors into graph structures that reduce the search space.
  • 有效载荷索引的功能与传统数据库中的索引类似。它们允许基于元数据(例如,语言或类别字段)进行快速筛选。
  • Payload indexes function similarly to indexes in traditional databases. They allow for fast filtering based on metadata (e.g., language or category fields).

索引虽然能提高速度和准确性,但也会增加内存和处理成本。用户可以根据预期的查询模式和基数,选择性地配置要建立索引的字段。索引参数在集合级别定义,但索引在段中的实际存在与否取决于优化规则和数据分布。

While indexing improves speed and accuracy, it incurs additional memory and processing costs. Users can selectively configure which fields should be indexed based on their expected query patterns and cardinality. Index parameters are defined at the collection level, but the actual index presence in segments depends on optimization rules and data distribution.

让我们把您在 Qdrant 中学到的所有概念直接整合到构建多模态 GenAI 系统的背景下。

Let us integrate all the Qdrant concepts you learned about directly into the context of building a Multimodal GenAI system.

在实际的多模态GenAI系统中,管理和检索跨模态(例如文本和图像)的数据不仅仅是嵌入向量;更重要的是高效地大规模组织、过滤和检索这些数据。正因如此,诸如集合、点、向量、有效载荷和索引等概念(如Qdrant等向量数据库中所实现的)才显得至关重要。

In a practical multimodal GenAI system, managing and retrieving data across modalities, like text and images, is not just about embedding vectors; it is about organizing, filtering, and retrieving them efficiently at scale. This is where concepts like collections, points, vectors, payloads, and indexes, as implemented in vector databases such as Qdrant, become critically important.

此类系统的核心是向量嵌入过程。对于每种输入数据类型,例如产品图像或其描述,神经网络模型(例如 CLIP 或 BLIP)会将输入转换为高维向量。这些向量捕捉语义信息,因此“红色跑车”这样的描述和红色跑车的图像会生成在向量空间中彼此接近的嵌入向量。然后,这些嵌入向量被分组到集合中,每个集合代表一个逻辑数据集片段。例如,一个集合可以存储所有与零售产品数据相关的向量,图像和文本作为命名向量存储在每个点下。

At the core of such a system is the vector embedding process. For each input data type, such as a product image or its description, a neural network model (e.g., CLIP or BLIP) converts the input into a high-dimensional vector. These vectors capture semantic meaning, so a caption like a red sports car and an image of a red sports car will generate embeddings that lie close to each other in the vector space. These embeddings are then grouped into collections, each representing a logical dataset segment. For example, a single collection may store all vectors related to retail product data, with images and text stored as named vectors under each point.

该集合中的每个点代表一个单独的项目,例如一个产品实例,并被分配一个唯一的点 ID。除了向量之外,点还可以包含有效载荷,其中存储着有用的元数据,例如语言、时间戳、产品类别,甚至是原始文件来源。在多模态 GenAI 设置中,当我们在检索过程中需要按模态、时间范围或其他条件筛选结果时,此有效载荷就显得至关重要。

Each point within this collection represents an individual item, say, a product instance, and is assigned a unique point ID. Alongside the vector(s), a point can include a payload, which stores useful metadata such as language, timestamp, product category, or even the original file source. In a multimodal GenAI setup, this payload becomes crucial when we want to filter results by modality, time range, or other criteria during retrieval.

当用户输入查询时,例如上传一张产品图片并附带类似“显示蓝色款的类似型号”这样的文字请求,系统需要执行混合搜索。这意味着不仅要基于向量相似度检索结果,还要使用有效负载中定义的约束条件(例如,color = "blue" )。为了实现这一点,Qdrant 支持有效负载索引,它允许对结构化元数据字段进行快速筛选,类似于传统关系数据库中的索引。

When a user inputs a query, perhaps a product photo with a textual request like show similar models available in blue, the system needs to perform a hybrid search. This means retrieving results not only based on vector similarity but also using constraints defined in the payload (e.g., color = "blue"). To enable this, Qdrant supports payload indexing, which allows fast filtering across structured metadata fields, much like indexes in traditional relational databases.

在后台,数据集被划分为多个段,每个段都有自己的存储和索引配置。根据性能需求,这些段可能使用内存存储以实现最高速度,也可能使用内存映射存储来优化 RAM 使用,同时仍能通过操作系统级页面缓存实现快速访问。对于服务数百万用户的实际 GenAI 应用而言,这种存储分段方式可确保可扩展性和容错性。

Behind the scenes, the collection is divided into segments, each with its own storage and indexing configuration. Depending on performance requirements, these segments may use in-memory storage for maximum speed or memory-mapped storage to optimize RAM usage while still enabling fast access via the OS-level page cache. For a real-world GenAI application that serves millions of users, segmenting storage this way ensures both scalability and fault tolerance.

最后,为了加速向量检索,Qdrant 支持高性能向量索引(例如 HNSW),使系统能够在无需暴力比较的情况下,快速逼近高维空间中的最近邻。结合有效载荷滤波器,这种索引策略能够实现精确控制的 ANN 检索,这对于实时多模态系统至关重要。

Finally, to accelerate vector retrieval, Qdrant supports high-performance vector indexes (e.g., HNSW) that allow the system to quickly approximate the nearest neighbors in high-dimensional space without brute-force comparison. Combined with payload filters, this indexing strategy enables ANN retrieval with precise control, which is vital for real-time multimodal systems.

实施方案比较

Implementation comparisons

在存储和搜索高维多模态嵌入时,通常有两种策略:使用带有过滤器的单个集合,以及创建多个带有局部索引的集合。当嵌入频繁更新时(例如在生产流程中每天更新),以及当查询需要细粒度控制时,这些设计决策尤为重要。

Two common strategies emerge when storing and searching high-dimensional multimodal embeddings: using a single collection with filters vs. creating multiple collections with localized indexing. These design decisions are especially important when embeddings are updated frequently, such as every day in a production pipeline, and when queries require fine-grained control.

单个集合,按有效载荷分区

Single collection, partitioned via payload

在这种方法中,所有向量嵌入,无论源自图像、文本还是多模态文档,都存储在一个统一的集合中。数据类型(例如日期或语言)的区分通过有效负载元数据来实现。例如,每个数据点可能带有类似{"date": "2025-05-09", "language": "en"} 的标签,这些标签在查询执行期间用作过滤器。

In this approach, all vector embeddings, whether derived from images, text, or multimodal documents, are stored in a single, unified collection. The differentiation between data types, such as dates or languages, is handled via payload metadata. For example, each point might carry tags like {"date": "2025-05-09", "language": "en"}, which are then used as filters during query execution.

这种设置简单且可扩展。只需维护一个数据集,所有嵌入都可以在同一个向量空间中搜索。在操作上,它成本效益高,并且易于与下游系统集成。然而,由于没有构建跨子集(例如,按日期或语言)的全局索引,人工神经网络(ANN)的检索准确率显著降低,与精确的K近邻(KNN)搜索相比,匹配率仅为50%左右。

This setup is simple and scalable. There is only one collection to maintain, and all embeddings are searchable in a single vector space. Operationally, it is cost-efficient and easy to integrate with downstream systems. However, because no global index is built across subsets (e.g., by date or language), the ANN retrieval accuracy is significantly lower, dropping to around a 50% match rate compared to exact KNN searches.

仅通过有效载荷过滤嵌入向量而不进行全局索引会造成效率低下,尤其是在数据集增长或数据在时间和类别上出现偏差时。例如,如果某个日期包含的数据量异常多,或者某种特定语言的数据量占主导地位,则由于过滤后的子集中向量分布不均,人工神经网络 (ANN) 搜索可能会降低精度。下图展示了一个通过有效载荷方法进行分割的数据集:

Filtering embeddings purely through payload without global indexing introduces inefficiencies, especially when the dataset grows or becomes skewed across time and classes. For example, if one date contains disproportionately more data or a specific language dominates, ANN search may lose precision due to uneven vector distribution across the filtered subsets. The following figure depicts a single collection, portioned via a payload approach:

流程图显示了文本和图像嵌入模型如何将文档和图像处理成矢量表示,并将这些矢量表示作为文本和图像有效载荷存储在矢量数据库中。

图 2.4:单个集合,按有效载荷进行分区。

Figure 2.4: A single collection, partitioned via payload

使用案例:此方法最适合那些优先考虑易于维护和成本控制而非完美检索准确性的环境,例如非关键检索任务或早期原型。

Use case: This method is best for environments where ease of maintenance and cost control are prioritized over perfect retrieval accuracy, such as in non-critical retrieval tasks or early-stage prototypes.

具有全局索引的多个集合

Multiple collections with global indexing

第二种策略选择按日期划分集合,每天创建一个集合(例如,embeddings_2025_05_08embeddings_2025_05_09 )。在每个集合中,显式构建一个全局向量索引(例如,HNSW),以实现高度优化的人工神经网络检索。然后,可以使用有效载荷过滤器按语言进一步划分每个集合。

The second strategy opts for separating collections by date, creating one collection per day (e.g., embeddings_2025_05_08, embeddings_2025_05_09). Within each collection, a global vector index (e.g., HNSW) is explicitly built to enable highly optimized ANN retrieval. Each collection can then be partitioned further by language using payload filters.

这种方法显著提高了基于人工神经网络(ANN)的搜索精度——与精确KNN相比,匹配率高达98%——因为每个集合都受益于局部索引和更均匀的嵌入分布。通过将搜索空间缩小到单个日期并仅在该时间段内进行过滤,该系统避免了大型全局集合中常见的向量聚类稀释问题。

This approach results in significantly higher precision during ANN-based searches—up to 98% match rate compared to exact KNN—because each collection benefits from localized indexing and a more homogeneous embedding distribution. By narrowing the search space to a single date and filtering only within that segment, the system avoids the dilution of vector clusters that occurs in large, global collections.

然而,这种模式是有代价的。维护多个集合会增加操作复杂性,系统必须管理每个新集合的索引成本。此外,随着时间的推移(例如,按小时或按用户),扩展到大量集合可能会导致资源效率低下和存储开销。

However, this model comes at a cost. Maintaining multiple collections increases operational complexity, and the system must manage the indexing cost for each new collection. Additionally, scaling to many collections over time (e.g., per hour or per user) may lead to resource inefficiency and storage overhead.

使用场景:当需要高精度推荐或精确语义搜索时,例如在产品推荐引擎、个性化助手或关键分析管道中,此模型是理想之选。

Use case: This model is ideal when high-accuracy recommendations or precise semantic search are required, such as in product recommendation engines, personalized assistants, or critical analytics pipelines.

下图展示了采用全局索引方法的多个集合:

The following figure depicts multiple collections with a global indexing approach:

图示为文本嵌入模型将文档处理成存储在文本集合中的文本块,以及图像嵌入模型将图像处理成图像集合,两者都保存在矢量数据库中。

图 2.5:具有全局索引的多个集合

Figure 2.5: Multiple collections with global indexing

在探讨了如何使用向量数据库组织、存储和检索多模态向量嵌入之后,我们现在将重点转向如何将这些功能应用于端到端人工智能系统。向量数据库是高效存储和搜索的基础。在各种模态下,基于此基础设施的架构选择,特别是使用多模态 GenAI 系统还是虚拟线性模型 (VLM),会对性能、可扩展性和应用契合度产生显著影响。在接下来的章节中,我们将探讨这两种方法之间的根本区别,并分析各自的适用场景。

Having explored how multimodal vector embeddings are organized, stored, and retrieved using vector databases, we now shift our focus to how these capabilities are applied in end-to-end AI systems. While vector databases serve as the backbone for efficient storage and search across modalities, the architectural choices made on top of this infrastructure, particularly whether to use a multimodal GenAI system or a VLM, can significantly influence performance, scalability, and application fit. In the following section, we examine the fundamental differences between these two approaches and look at when each should be used.

多模态生成式人工智能系统与虚拟语言模型

Multimodal generative AI systems vs. VLMs

随着人工智能的演进,对能够理解和生成多种数据模态(例如文本、图像、音频或结构化数据)的系统的需求显著增长。两种方法已脱颖而出,成为这一发展趋势的前沿:虚拟语言模型(VLM)和更广泛的多模态生成人工智能(GenAI)系统。虽然这两个术语有时可以互换使用,但它们用途不同,架构原则也不同。本节将阐明它们的区别,并就何时最适合应用每种方法提供指导。

As AI evolves, the demand for systems that can understand and generate across multiple data modalities, such as text, images, audio, or structured data, has grown significantly. Two approaches have emerged at the forefront of this advancement: VLMs and broader multimodal GenAI systems. While the terms are sometimes used interchangeably, they serve distinct purposes and operate under different architectural principles. This section clarifies their differences and offers guidance on when each is best applied.

视觉语言模型

Vision-language models

视觉语言模型(VLM)是多模态人工智能系统的一个子集,专门用于整合视觉和文本模态。这些模型经过训练,能够理解图像特征并与语言特征相匹配,从而实现图像描述、视觉问答、图像文本检索和跨模态推理等任务。

VLMs are a subset of multimodal AI systems that specifically integrate visual and textual modalities. These models are trained to understand and align image features with language features, enabling tasks such as image captioning, VQA, image-text retrieval, and cross-modal reasoning.

视觉学习模型(VLM)通常采用融合两种独立神经编码器嵌入的架构:视觉编码器(例如 ViT、ResNet)和文本编码器(例如 BERT、RoBERTa 或 GPT)。融合后的表示使模型能够跨模态进行推理。一些模型使用交叉注意力机制,使图像标记能够关注文本标记,反之亦然;而另一些模型则使用对比学习(例如 CLIP、ALIGN)将图像和文本映射到共享的潜在空间以进行检索。

VLMs are typically built using architectures that fuse embeddings from two separate neural encoders: a vision encoder (e.g., ViT, ResNet) and a text encoder (e.g., BERT, RoBERTa, or GPT). The fused representation allows the model to reason across both modalities. Some models use cross-attention mechanisms to allow image tokens to attend to text tokens, and vice versa, while others use contrastive learning (e.g., CLIP, ALIGN) to map images and texts into a shared latent space for retrieval.

以下是VLM的示例:

The following are examples of VLMs:

  • 夹子
  • CLIP
  • BLIP 和 BLIP-2
  • BLIP and BLIP-2
  • UNITER、LXMERT、VisualBERT
  • UNITER, LXMERT, VisualBERT
  • 用于少样本多模态推理的 Flamingo 和 Kosmos-1
  • Flamingo and Kosmos-1 for few-shot multimodal reasoning

VLM 通常使用自监督或半监督学习目标,在大型图像-文本数据集上进行预训练,并且可以针对需要视觉语言对齐的下游任务进行微调。

VLMs are often pretrained on large image-text datasets, using self-supervised or semi-supervised learning objectives, and can be fine-tuned on downstream tasks requiring vision-language alignment.

多模态生成式人工智能系统

Multimodal generative AI systems

相比之下,多模态GenAI系统旨在跨多种模态运行,这些模态通常是任意的,而不仅限于视觉和语言。这些系统结合了检索、推理和生成等组件,通常通过模块化架构进行协调。

In contrast, multimodal GenAI systems are designed to operate across multiple and often arbitrary modalities, not limited to vision and language. These systems combine components for retrieval, reasoning, and generation, often orchestrated through modular architectures.

一个关键区别在于多模态 GenAI 系统中常用的检索增强架构。这些系统并非仅仅依赖单个预训练模型,而是:

A key difference is the retrieval-augmented architecture often used in multimodal GenAI systems. Instead of relying solely on a single pretrained model, these systems:

  • 从外部存储中检索相关的文本、图像或结构化数据。
  • Retrieve relevant text, image, or structured data from external stores.
  • 对这些信息进行编码和融合。
  • Encode and fuse that information.
  • 使用语言模型、图像生成器或其他特定模态生成器生成输出。
  • Generate outputs using a language model, image generator, or other modality-specific generator.

多模态GenAI系统可以将虚拟线性模型(VLM)作为子组件集成,但并不局限于此。它们通常支持以下功能:

Multimodal GenAI systems can incorporate VLMs as subcomponents but are not limited to them. They often support the following:

  • 图片+文字结构化报告
  • Image + text Structured report
  • 音频+文本摘要
  • Audio + text Summary
  • 文字图片 + 标题
  • Text image + Caption
  • 图片文本 + SQL 查询
  • Image text + SQL query
  • PDF + 图表自然语言解释
  • PDF + diagram Natural language explanation

这些系统通常基于流水线,结合不同的模型和检索层来执行任务。RAG(资源获取、请求生成)、编排层(例如 LangChain 或 LangGraph)以及工具的使用(通过代理)都很常见。

These systems are typically pipeline-based, combining different models and retrieval layers to perform a task. RAG, orchestration layers (like LangChain or LangGraph), and tool use (via agents) are common.

让我们快速了解一下建筑上的差异:

Let us look at the architectural differences at a glance:

特征

Feature

VLM

VLMs

多模态基因人工智能系统

Multimodal GenAI systems

支持的模式

Modalities supported

视觉+纯文本

Vision + text-only

任何形式(文本、图像、音频、视频、表格)

Any modality (text, image, audio, video, tables)

模型结构

Model structure

端到端统一变压器

End-to-end unified transformer

模块化管道,带有独立的回收器和发生器

Modular pipeline with separate retrievers and generators

典型用例

Typical use case

字幕制作、视频质量保证、检索

Captioning, VQA, retrieval

多模态聊天、文档分析、RAG(红绿灯)、复杂工作流程

Multimodal chat, document analysis, RAG, complex workflows

数据来源

Data sources

基于图像-文本对进行预训练

Pretrained on image-text pairs

与数据库、API、工具和内存集成

Integrates with DBs, APIs, tools, and memory

检索层

Retrieval layer

并非总是存在

Not always present

建筑的组成部分

Integral part of architecture

灵活性和定制性

Flexibility and customization

缓和

Moderate

高的

High

使用代理或编排

Use of agents or orchestration

稀有的

Rare

通用(LangChain、LlamaIndex 等)

Common (LangChain, LlamaIndex, etc.)

可扩展性

Scalability

受型号尺寸限制

Limited by model size

可扩展性强,支持检索和模块化

Scalable with retrieval and modularity

表 2.1:VLM 与多模态 GenAI 系统的比较

Table 2.1: Comparison of VLMs vs. multimodal GenAI systems

使用视觉语言模型

Using vision-language models

在以下情况下应使用 VLM:

You should use VLMs in the following cases:

  • 您的任务与视觉和文本对齐密切相关,例如:
    • 根据文本提示进行图像分类
    • 字幕生成
    • VQA
    • 图像-文本或文本-图像检索
  • Your task is tightly coupled with visual and textual alignment, such as:
    • Image classification with text prompts
    • Caption generation
    • VQA
    • Image-text or text-image retrieval
  • 你需要端到端的性能,低延迟推理,并且不需要外部检索逻辑。
  • You need end-to-end performance with low-latency inference and no external retrieval logic.
  • 您目前使用的模型有限,并且更喜欢单模型设置而不是流水线编排。
  • You are working with limited modalities and prefer a single-model setup over pipeline orchestration.
  • 你需要使用在大规模数据集上预训练的跨模态理解模型,并且可以针对特定任务进行微调。
  • You require models that have been pretrained on large-scale datasets for cross-modal understanding and can be fine-tuned for specialized tasks.
  • 您的环境更倾向于模型推理,而不是动态组合,例如移动应用程序或边缘设备。
  • Your environment favors model inference over dynamic composition, such as mobile applications or edge devices.

简而言之,VLM 非常适合受控的视觉语言任务,这些任务可以从单个模型中的深度跨模态表征学习中受益。

In short, VLMs are ideal for controlled, vision-language tasks that benefit from deep cross-modal representation learning in a single-model.

使用多模态生成式人工智能系统

Using multimodal generative AI systems

在以下情况下使用多模态GenAI系统:

Use multimodal GenAI systems when:

  • 除了视觉和文本之外,你还需要结合多种其他模态,例如:
    • 文本+图像+表格(例如,文档解析)
    • 音频+文本(例如,会议摘要)
    • PDF + 图表 + 问题(例如,科学论文分析)
  • You need to combine multiple modalities beyond vision and text, such as:
    • Text + image + tables (e.g., document parsing)
    • Audio + text (e.g., meeting summarization)
    • PDF + chart + question (e.g., scientific paper analysis)
  • 你需要检索增强推理:
    • 搜索矢量数据库
    • 检索相关文档或图像
    • 使用外部电源接地输出
  • You require retrieval-augmented reasoning:
    • Search a vector database
    • Retrieve related documents or images
    • Ground the output using external sources
  • 您的任务需要使用工具、决策逻辑或基于代理的编排:
    • 数据库查询(SQL 生成)
    • 调用 API
    • 多步骤推理(例如,问题分解)
  • Your task demands tool use, decision logic, or agent-based orchestration:
    • Querying databases (SQL generation)
    • Invoking APIs
    • Multi-step reasoning (e.g., question decomposition)
  • 您需要的是一种灵活且模块化的架构:
    • 可更换的取回器
    • 自定义嵌入模型
    • 可替换的大型语言模型(OpenAI、Ollama、Claude 等)
  • You want a flexible and modular architecture:
    • Replaceable retrievers
    • Custom embedding models
    • Swappable large language models (OpenAI, Ollama, Claude, etc.)
  • 你需要一个定期更新的动态知识库:
    • 嵌入更新
    • 多语言检索
    • 使用元数据进行过滤搜索(例如,通过 Qdrant 或 Weaviate)
  • You need a dynamic knowledge base that updates regularly:
    • Embedding updates
    • Multilingual retrieval
    • Filtered search using metadata (e.g., via Qdrant or Weaviate)

从本质上讲,多模态 GenAI 适用于需要高度适应性、实时数据集成和 AI 组件复杂编排的企业级、多用途应用。

In essence, multimodal GenAI is suited for enterprise-grade, multi-purpose applications that demand high adaptability, live data integration, and sophisticated orchestration of AI components.

实际案例比较

Real-world example comparison

让我们根据一份包含文字和图片的产品手册来回答一个问题。

Let us take the task of answering a question based on a product manual that includes both text and figures.

  • VLM 可以给图像添加说明文字,或者回答一些基本的视觉问题,例如开关是什么颜色的?
  • A VLM might be able to caption the image or answer basic visual questions, such as what color is the switch?
  • 然而,多模态GenAI系统可以:
    • 解析PDF文件
    • 提取图表
    • 对嵌入式图像使用OCR技术
    • 获取相关产品规格
    • 使用语言模型生成完整的响应,例如:激活省电模式,请按下图 2.3所示的绿色开关。该开关控制辅助电路。
  • A multimodal GenAI system, however, could:
    • Parse the PDF
    • Extract diagrams
    • Use OCR on embedded images
    • Retrieve relevant product specs
    • Use a language model to generate a full response like: to activate the power-saving mode, press the green switch shown in Figure 2.3. The switch controls the secondary circuit.

这种涵盖文本、图像、布局和逻辑的端到端能力是多模态 GenAI 的标志。

This end-to-end capability across text, image, layout, and logic is the hallmark of multimodal GenAI.

基于输出的多模态系统分类

Output-based classification of multimodal systems

多模态全人类人工智能系统不仅能够处理多种类型的输入,例如文本、图像、音频或结构化数据,而且还能够生成多种输出。随着各组织在电子商务、医疗保健、软件开发和知识管理等领域部署人工智能系统了解这些系统的重要性日益凸显。根据输出性质对多模态系统进行分类。这种分类有助于更好地进行架构设计、模型选择,并与下游用例保持一致。

Multimodal GenAI systems are distinguished not only by their ability to process diverse types of input, such as text, images, audio, or structured data, but also by the variety of outputs they are capable of producing. As organizations deploy AI systems across sectors like e-commerce, healthcare, software development, and knowledge management, it becomes important to classify multimodal systems based on the nature of their output. This classification allows for better architectural design, model selection, and alignment with downstream use cases.

本节介绍了一种基于多模态系统生成的输出类型对其进行分类的框架,重点关注六个核心类别:

This section introduces a framework for classifying multimodal systems based on the type of output they generate, focusing on six core categories:

  • 文本转图像
  • Text-to-image
  • 图像转文本
  • Image-to-text
  • 文本和图像到图像
  • Text and image-to-image
  • 文本到规格和图像
  • Text-to-specifications and image
  • 文本转 SQL
  • Text-to-SQL
  • 文本转代码
  • Text-to-code

这些类别中的每一个都反映了独特的世代发展路径,它们各自具有不同的模式、挑战和应用。

Each of these categories reflects a unique generation pathway with its own models, challenges, and applications.

文本转图像系统

Text-to-image systems

文本到图像的生成是多模态人工智能领域的一项突破性技术,它使系统能够将自然语言提示转化为生动且上下文准确的图像。这一过程的核心是强大的生成模型,例如 DALL·E 2、Stable Diffusion、Imagen 和 Parti,它们能够学习文本语义和视觉特征之间复杂的映射关系。这些系统通常将基于 Transformer 的文本编码器与扩散或自回归解码器相结合,有时还会借助超分辨率模块进行增强。其应用范围涵盖创意设计、广告、娱乐和个性化媒体等领域。尽管这项技术前景广阔,但在提示与图像的对齐、纹理保真度以及消除偏差等方面仍然存在挑战,这也凸显了当前研究人员致力于提升生成图像的真实性、可控性和公平性的努力。

Text-to-image generation is a breakthrough capability in multimodal AI, enabling systems to transform natural language prompts into vivid, contextually accurate images. At the heart of this process are powerful generative models like DALL·E 2, Stable Diffusion, Imagen, and Parti, which learn complex mappings between textual semantics and visual features. These systems typically combine transformer-based text encoders with diffusion or autoregressive decoders, sometimes enhanced by super-resolution modules. Applications span creative design, advertising, entertainment, and personalized media. Despite their promise, challenges remain in prompt-image alignment, texture fidelity, and mitigating biases, highlighting ongoing research efforts to improve realism, controllability, and fairness in generated outputs.

文本到图像的生成是指仅基于自然语言描述生成视觉表示(图像)的过程。这些系统使用强大的生成模型将描述性输入转换为详细且具有上下文感知能力的图像:

Text-to-image generation refers to the process of generating a visual representation (image) based solely on a natural language description. These systems translate descriptive input into detailed and context-aware images using powerful generative models:

  • 核心模型:常用的文本转图像模型包括:
    • DALL·E 2(OpenAI)
    • 稳定扩散(稳定性人工智能)
    • Imagen(谷歌研究院)
    • 党派(谷歌)

    这些模型使用扩散技术或基于 Transformer 的架构来学习语义文本输入和视觉输出之间的映射关系。

  • Core models: Popular text-to-image models include:
    • DALL·E 2 (OpenAI)
    • Stable Diffusion (Stability AI)
    • Imagen (Google Research)
    • Parti (Google)

    These models use either diffusion techniques or transformer-based architectures to learn mappings between semantic textual inputs and visual outputs.

  • 建筑学
    • 文本编码器(通常是转换器)处理提示信息。
    • 解码器(例如,扩散模型)逐像素生成图像,或者通过中间潜在表示生成图像。
    • 可选的图像超分辨率模块可提高图像保真度。
  • Architecture:
    • The text encoder (usually a transformer) processes the prompt.
    • The decoder (e.g., diffusion model) generates an image pixel-by-pixel or through intermediate latent representations.
    • Optional image super-resolution modules enhance fidelity.
  • 应用领域
    • 营销和广告创意
    • 产品设计模型
    • 娱乐和游戏开发
    • 个性化内容生成
  • Applications:
    • Marketing and advertising creatives
    • Product design mockups
    • Entertainment and game development
    • Personalized content generation
  • 挑战
    • 确保提示和图像细节一致
    • 生成精细纹理和空间布局
    • 训练数据中图像输出的偏差
  • Challenges:
    • Ensuring alignment between prompt and image details
    • Generating fine-grained textures and spatial arrangements
    • Bias in image outputs from training data

图像转文本系统

Image-to-text systems

图像到文本生成系统使机器能够使用自然语言解释和描述视觉内容,从而弥合视觉和语言之间的鸿沟。这些系统超越了基本的字幕生成功能,能够从图表、场景或示意图等复杂的视觉对象中提供丰富的摘要或结构化的见解。它们采用 BLIP、MiniGPT-4 和 Flamingo 等模型,将视觉编码器与语言解码器相结合,从图像中生成连贯的文本。这些模型基于精心挑选或自监督的数据集进行训练,支持辅助功能、内容管理和视觉质量保证 (VQA) 等领域的应用:

Image-to-text generation systems empower machines to interpret and describe visual content using natural language, bridging the gap between vision and language. These systems go beyond basic captioning to deliver rich summaries or structured insights from complex visuals like charts, scenes, or diagrams. Powered by models such as BLIP, MiniGPT-4, and Flamingo, they combine vision encoders with language decoders to generate coherent text from images. Trained on curated or self-supervised datasets, these models support applications in accessibility, content management, and VQA:

  • 核心模型
    • BLIP/BLIP-2(Salesforce)
    • MiniGPT-4
    • VisualGPT
    • 火烈鸟(DeepMind)
  • Core models:
    • BLIP/BLIP-2 (Salesforce)
    • MiniGPT-4
    • VisualGPT
    • Flamingo (DeepMind)
  • 生成式图像到文本架构
    • 视觉编码器从图像中提取视觉特征。
    • 语言模型解码器将视觉特征转换为连贯、人类可读的文本。

    这些模型可以使用 COCO 等监督数据集或自监督图像描述对进行训练。

  • Generative image-to-text architecture:
    • A vision encoder extracts visual features from the image.
    • A language model decoder translates visual features into coherent, human-readable text.

    These models can be trained using supervised datasets like COCO or self-supervised image caption pairs.

  • 应用领域
    • 自动生成字幕,方便用户使用
    • 数字资产标签和分类
    • VQA
    • 网站和文档中的描述性替代文本生成
  • Applications:
    • Automatic captioning for accessibility
    • Digital asset tagging and classification
    • VQA
    • Descriptive alt-text generation in websites and documents
  • 挑战
    • 理解空间和关系信息。
    • 处理抽象或艺术性内容。
    • 跨图像领域进行概括(例如,医学图像、航空图像、合成图像)。
  • Challenges:
    • Understanding spatial and relational information.
    • Handling abstract or artistic content.
    • Generalizing across image domains (e.g., medical, aerial, synthetic).

文本和图像系统

Text and image systems

这类多模态系统以文本和图像作为输入,并生成修改或合成的图像作为输出。这些模型通常根据输入提示进行引导式生成或编辑。

This class of multimodal systems takes both text and image as input and produces a modified or synthesized image as output. These models often perform guided generation or editing based on the input prompt.

文本和图像系统代表了一种先进的多模态人工智能,它同时利用视觉和文本输入来指导图像的生成或编辑。与传统的文本到图像模型不同,这些系统会根据现有图像和描述性提示来生成输出,从而实现对修改的精细控制。诸如 InstructPix2Pix、ControlNet 和 Paint by Text 等模型利用双编码器来提取和融合视觉和语言特征,生成具有上下文感知能力的视觉输出。其应用范围涵盖智能照片编辑、视觉个性化以及设计原型制作等。然而,如何在保证提示信息的准确性和图像完整性之间取得平衡仍然是一项挑战——既要确保结构一致性、对象完整性,又要实现逼真的变换,同时避免过度改变源图像。让我们来详细了解一下:

Text and image systems represent an advanced category of multimodal AI where both visual and textual inputs are used to guide image generation or editing. Unlike traditional text-to-image models, these systems condition outputs on an existing image and a descriptive prompt—enabling fine-grained control over modifications. Models like InstructPix2Pix, ControlNet, and Paint by Text leverage dual encoders to extract and merge visual and linguistic features, producing context-aware visual outputs. Applications range from intelligent photo editing and visual personalization to design prototyping. However, challenges persist in balancing prompt fidelity with image integrity—ensuring structural consistency, object preservation, and realistic transformations without over-altering the source image. Let us understand it in detail:

  • 核心模型
    • InstructPix2Pix
    • ControlNet(与稳定扩散一起使用)
    • 文字绘画
    • Text2LIVE

    这些模型通过引入参考图像或条件化机制来扩展基本的文本到图像处理流程。

  • Core models:
    • InstructPix2Pix
    • ControlNet (used with Stable Diffusion)
    • Paint by Text
    • Text2LIVE

    These models extend basic text-to-image pipelines by incorporating reference images or conditioning mechanisms.

  • 建筑学
    • 图像编码器从输入图像中提取特征。
    • 文本编码器用于捕获指令或提示。
    • 条件图像生成模型将两种模态融合起来,生成编辑或引导的图像。
  • Architecture:
    • An image encoder extracts features from the input image.
    • A text encoder captures instructions or prompts.
    • A conditioned image generation model merges both modalities to generate an edited or guided image.
  • 应用领域
    • 人工智能辅助照片编辑(例如,将日落天空变成橙色
    • 视觉个性化
    • 用变体扩充训练数据
    • 根据反馈进行设计迭代
  • Applications:
    • AI-assisted photo editing (e.g., make the sky sunset orange)
    • Visual personalization
    • Augmenting training data with variations
    • Design iteration based on feedback
  • 挑战
    • 控制与原始图像的改变程度
    • 保持对象身份和结构
    • 保持视觉连贯性和真实感
  • Challenges:
    • Controlling the degree of change from the original image
    • Preserving object identity and structure
    • Maintaining visual coherence and realism

纯文本到规格和图像系统

Text-only to specifications and image systems

这些系统接收基于文本的提示(通常是描述性的或功能性的),并生成结构化的规范(例如,物料清单、布局图或产品蓝图)和相应的视觉输出。

These systems take a text-based prompt, often descriptive or functional in nature, and generate both a structured specification (e.g., a bill of materials, layout plan, or product blueprint) and a corresponding visual output.

这项任务需要对文本中表达的意图有深刻的理解,并且能够生成与该意图相符的多模态输出。

This task requires a deep understanding of both the intent expressed in text and the ability to generate multimodal outputs aligned with that intent.

让我们来看一些使用案例:

Let us look at some example use cases:

  • 根据文本描述生成用户界面模型和组件规格。
  • Generating UI mockups and component specs from textual descriptions.
  • 创建包含材料清单的建筑蓝图。
  • Creating architectural blueprints with material listings.
  • 产品设计规格及可视化原型。
  • Product design specs with visual prototypes.
  • 机器人或物联网配置工作流程。

    在新兴的多模态系统中,文本到规范的转换可以与图像架构相结合,从而将结构化输出生成的精确性与视觉合成的创造性融合起来。这些系统能够解读用户提示,生成机器可读的规范(例如 JSON、YAML)和相应的图像,从而实现从概念到设计的无缝过渡。语言模型将用户意图解码为结构化数据,而文本到图像生成器则将相同的概念可视化,通常会基于共享的潜在特征来确保一致性。这种架构在人工智能辅助设计、产品定制和数字原型制作等应用中至关重要,但它在保持语义准确性和同步双重输出方面面临着挑战:

  • Robotics or IoT configuration workflows.

    In emerging multimodal systems, text to specs can also be included with image architectures that combine the precision of structured output generation with the creativity of visual synthesis. These systems interpret user prompts to produce both machine-readable specifications (e.g., JSON, YAML) and corresponding images—enabling seamless transitions from concept to design. A language model decodes intent into structured data, while a text-to-image generator visualizes the same concept, often conditioned on shared latent features to ensure alignment. This architecture is key in applications like AI-assisted design, product customization, and digital prototyping, though it faces challenges in maintaining semantic accuracy and synchronizing dual outputs:

  • 建筑学
    • 语言模型解释用户的意图并生成结构化输出(JSON、YAML 等格式)。
    • 文本转图像生成器根据相同或改进后的提示将结果可视化。
    • 在高级设置中,共享的潜在表示对两个输出都起作用,以确保一致性。
  • Architecture:
    • A language model interprets the user’s intent and generates structured output (in JSON, YAML, etc.).
    • A text-to-image generator visualizes the outcome based on the same or refined prompt.
    • In advanced setups, a shared latent representation conditions both outputs to ensure consistency.
  • 应用领域
    • 人工智能辅助设计(Figma插件、CAD工具)
    • 电子商务产品定制
    • 基于指令的数字孪生或原型
  • Applications:
    • AI-assisted design (Figma plugins, CAD tools)
    • E-commerce product customization
    • Instruction-based digital twins or prototypes
  • 挑战
    • 同步结构化和可视化输出
    • 确保规范的语义正确性
    • 处理自然语言提示中的歧义
  • Challenges:
    • Synchronizing structured and visual outputs
    • Ensuring the semantic correctness of specifications
    • Handling ambiguities in natural language prompts

文本转SQL系统

Text-to-SQL systems

文本到 SQL 系统通过将用户查询转换为可执行的 SQL 语句,将自然语言理解与结构化数据库查询相结合。这些系统无需用户了解 SQL 语法即可实现直观的数据访问。在高级多模态配置中,这些模型可以整合其他输入,例如表格、文档或图像(例如扫描的发票),以及文本,从而生成准确且上下文相关的 SQL 查询。这些系统由 SQLCoder 等模型以及 PICARD + T5 等模式约束变体提供支持,并在 Spider 和 CoSQL 等基准测试中进行评估,从而拓展了数据库交互和企业分析自动化的边界。让我们来看看它们的详细信息:

Text-to-SQL systems bridge natural language understanding with structured database querying by translating user queries into executable SQL statements. These systems enable intuitive data access without requiring users to know SQL syntax. In advanced multimodal configurations, the models can incorporate additional inputs, such as tables, documents, or images (e.g., scanned invoices), alongside text to generate accurate, context-aware SQL queries. Powered by models like SQLCoder and schema-constrained variants such as PICARD + T5, these systems are evaluated on benchmarks like Spider and CoSQL, pushing the boundaries of database interaction and enterprise analytics automation. Let us look at their details:

  • 核心模型
    • SQLCoder
    • 使用少样本提示的文本到 SQL LLM
    • PICARD + T5(模式约束生成)
    • CoSQL 和 Spider 作为常用基准测试工具。
  • Core models:
    • SQLCoder
    • Text-to-SQL LLMs using few-shot prompting
    • PICARD + T5 (schema-constrained generation)
    • CoSQL, Spider as common benchmarks
  • 建筑学
    • 语言模型会解释输入提示。
    • 它使用以下两种方法之一:
      • 模式感知解码(自动完成表名和列名)
      • 从向量数据库中检索模式
      • 代理规划(针对多轮查询)

    在一些高级系统中,文档嵌入和多模态信号被用来动态地指导 SQL 生成。

  • Architecture:
    • A language model interprets the input prompt.
    • It uses either:
      • Schema-aware decoding (auto-complete table and column names)
      • Retrieval of schema from vector DBs
      • Agentic planning (for multi-turn queries)

    In some advanced systems, document embeddings and multimodal signals are used to dynamically guide SQL generation.

  • 应用领域
    • 对话式商业智能工具
    • 客户服务仪表盘
    • 企业数据库的自然语言界面
  • Applications:
    • Conversational BI tools
    • Customer service dashboards
    • Natural language interfaces for enterprise databases
  • 挑战
    • 用户查询中的歧义
    • 模式对齐和动态数据库
    • 安全性和查询优化
  • Challenges:
    • Ambiguity in user queries
    • Schema alignment and dynamic databases
    • Security and query optimization

文本转代码系统

Text-to-code systems

文本转代码系统能够将自然语言指令自动翻译成可执行代码,从而简化软件开发流程并加速自动化。这些系统利用 Codex、Code Llama 和 StarCoder 等强大的代码导向型语言模型,可以生成从简单函数到功能齐全的应用程序的各种代码,并支持多种编程语言。作为人工智能领域增长最快的技术之一,文本转代码技术正在重塑开发人员构建软件原型、调试和开发的方式。这些模型可应用于集成开发环境 (IDE)、低代码平台和开发者辅助等领域,从而降低技术门槛,并提高各种编码任务的效率。请参考以下列表,深入了解文本转代码系统:

Text-to-code systems enable the automatic translation of natural language instructions into executable code, streamlining software development and accelerating automation. Leveraging powerful code-focused language models like Codex, Code Llama, and StarCoder, these systems can generate anything from simple functions to full-fledged applications across programming languages. As one of the fastest-growing areas in GenAI, text-to-code technology is reshaping how developers prototype, debug, and build software. With applications in IDE integration, low-code platforms, and developer assistance, these models reduce technical barriers and boost productivity across a wide range of coding tasks. Refer to the following list to build an understanding of text-to-code systems:

  • 核心模型
    • Codex(OpenAI)
    • Code Llama(元)
    • StarCoder
    • PolyCoder,代码生成器

    这些模型在大规模编程语料库(例如 GitHub)上进行训练,并针对指令遵循进行微调。

  • Core models:
    • Codex (OpenAI)
    • Code Llama (Meta)
    • StarCoder
    • PolyCoder, CodeGen

    These models are trained on large-scale programming corpora (e.g., GitHub) and fine-tuned for instruction-following.

  • 建筑学
    • 语言模型经过微调,能够理解开发者的提示,并生成语法有效且功能相关的代码。
    • 可以通过以下方式增强模型的上下文信息:
      • 蜜蜂
      • 代码库
      • 文档

    在某些多模态设置中,可以将可视化图表(例如流程图或统一建模语言( UML ))与提示相结合,以生成与可视化逻辑一致的代码。

  • Architecture:
    • A language model is fine-tuned to understand developer prompts and produce syntactically valid and functionally relevant code.
    • Models can be enhanced with context from:
      • APIs
      • Codebases
      • Documentation

    In some multimodal setups, visual diagrams (e.g., flowcharts or Unified Modeling Language (UML)) can be paired with prompts to generate code that aligns with visual logic.

  • 应用领域
    • 根据功能需求自动生成 API 或脚本
    • 根据文档创建测试用例
    • 智能代码补全和错误修复
    • 低代码/无代码开发接口
  • Applications:
    • Auto-generating APIs or scripts from functional requirements
    • Creating test cases from documentation
    • Intelligent code completion and bug fixing
    • Low-code/no-code development interfaces
  • 挑战
    • 确保代码的正确性和安全性
    • 将用户意图与代码功能保持一致
    • 管理版本控制并将其集成到实际系统中
  • Challenges:
    • Ensuring code correctness and safety
    • Aligning user intent with code functionality
    • Managing versioning and integration into real systems

根据输出类型对多模态系统进行分类,有助于清晰了解系统功能、架构要求和部署准备情况。虽然这六类分类并非详尽无遗,但它们代表了当今最常见的生产级应用场景。

Classifying multimodal systems based on output type provides clarity on system capabilities, architectural requirements, and deployment readiness. While these six classes are not exhaustive, they represent the most common production-grade use cases emerging today.

多模态人工智能系统可以根据其生成的输出类型和所需的输入进行分类。这种分类有助于理解不同的文本和图像输入组合如何产生不同的输出,例如图像、代码、SQL 或结构化规范。下表概述了主要输出类型、它们对应的输入模态以及代表性用例类别,涵盖从创意设计和个性化到数据分析、自动化和辅助功能等各个方面:

Multimodal AI systems can be categorized based on the type of output they generate and the inputs they require. This classification helps in understanding how different combinations of text and image inputs lead to varied outputs such as images, code, SQL, or structured specifications. The following table outlines key output types, their corresponding input modalities, and representative use case categories, ranging from creative design and personalization to data analytics, automation, and accessibility:

输出类型

Output type

所需输入

Inputs required

生成的输出

Output generated

用例类别

Use case category

文本转图像

Text-to-image

文本

Text

图像

Image

设计、营销、创意人工智能

Design, marketing, creative AI

图像转文本

Image-to-text

图像

Image

文本

Text

可访问性、搜索、索引

Accessibility, search, indexing

文本+图像到图像

Text + image-to-image

文字+图片

Text + image

图像

Image

引导式编辑,个性化

Guided editing, personalization

文本转规格 + 图片

Text to specs + image

文本

Text

结构化输出 + 图像

Structured output + image

设计自动化、工程

Design automation, engineering

文本转 SQL

Text-to-SQL

文本

Text

SQL 查询

SQL query

分析、商业智能、数据搜索

Analytics, BI, data search

文本转代码

Text-to-code

文本

Text

代码片段

Code snippet

开发、自动化

Development, automation

表 2.2:按输出类型和用途划分的多模式系统

Table 2.2: Multimodal systems by output type and use

随着多模态系统的不断发展,我们可以预见跨越多种输出类别的混合模型将会出现。例如,读取文档(包含图像的PDF)、检索数据库上下文并生成代码片段或SQL查询的系统不再是设想,它已经在企业级人工智能技术栈中得到开发。

As multimodal systems continue to evolve, we can expect hybrid models that span multiple output classes. For instance, a system that reads a document (PDF with images), retrieves database context, and produces a code snippet or SQL query is no longer hypothetical, it is already under development in enterprise AI stacks.

因此,这些系统的设计者必须将输出类型作为主要设计轴,使其与领域需求、用户体验目标和基础设施能力保持一致。

Designers of these systems must therefore consider output type as a primary design axis, aligning it with domain needs, user experience goals, and infrastructure capabilities.

结论

Conclusion

本章探讨了构建高效多模态全人类人工智能(GenAI)系统的核心架构、分类和设计选择。我们区分了向量模型(VLM)与更广泛的多模态全人类人工智能流水线,考察了它们的输出类型(从文本到图像、图像到文本、文本到SQL和代码),并分析了使用Qdrant等向量数据库的实现策略。从单一集合到检索编排,每种设计都会影响系统的可扩展性、性能和准确性。通过基于输出类型对系统进行分类,并将其与用例需求相匹配,我们可以清晰地了解何时应该采用专用模型,何时应该采用模块化、检索增强型架构。这种理解为设计可扩展、准确且高效的多模态人工智能应用奠定了基础。

In this chapter, we explored the architecture, classifications, and design choices central to building effective multimodal GenAI systems. We differentiated VLMs from broader Multimodal GenAI pipelines, examined their outputs, from text-to-image and image-to-text to text-to-SQL and code, and analyzed implementation strategies using vector databases like Qdrant. Each design, from single collections to retrieval orchestration, impacts scalability, performance, and accuracy. By classifying systems based on output type and aligning them with use case requirements, we gain clarity on when to adopt specialized models versus modular, retrieval-augmented architectures. This understanding forms the foundation for designing scalable, accurate, and efficient multimodal AI applications.

下一章将介绍如何使用本地LLM设计和实现完全离线的GenAI系统。本章将重点关注隐私优先和成本效益高的部署方式,指导您使用Ollama、ChromaDB、FAISS和LangChain等工具构建RAG管道,所有工具均在本地运行,无需依赖云API。

In the next chapter, you will learn how to design and implement a fully offline GenAI system using local LLMs. Focusing on privacy-first and cost-efficient deployments, the chapter guides you through building a RAG pipeline using tools like Ollama, ChromaDB, FAISS, and LangChain, all running locally without reliance on cloud APIs.

你将使用 Python 嵌入文档、构建检索器并集成 LLM 进行质量保证。最终,你将开发出一个安全、可定制的基于文档的质量保证机器人,该机器人能够完全离线运行,并拥有对数据和计算资源的完全控制权。

You will embed documents, build a retriever, and integrate an LLM for QA using Python. By the end, you will have developed a secure, customizable document-based QA bot capable of operating entirely offline with complete control over data and compute resources.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C3实现单模态局部GenAI系统

CHAPTER 3Implementing Unimodal Local GenAI System

介绍

Introduction

本章我们将着手构建一个基于本地大型语言模型LLM )的检索增强生成RAG )系统,该系统完全离线运行,无需依赖任何云端应用程序编程接口API )。对于重视隐私、数据主权或预算受限的组织而言,这种方法至关重要。您将学习如何搭建一个安全且私密的生成式人工智能GenAI )管道,适用于企业级或边缘部署。

In this chapter, we embark on building a retrieval-augmented generation (RAG) system using local large language models (LLMs), completely offline and free from any dependency on cloud-based application programming interfaces (APIs). This approach is essential for organizations prioritizing privacy, data sovereignty, or operating under strict budget constraints. You will learn how to setup a secure and private generative AI (GenAI) pipeline suitable for enterprise or edge deployments.

我们将使用 Ollama 在本地运行功能强大的开源 LLM,确保所有数据都保留在您的计算机上。对于文档嵌入的存储和查询,您可以选择Facebook AI Similarity Search ( Faiss ) 或 Chroma,两者都针对快速高效的相似性搜索进行了优化。检索过程将由 LangChain 管理,这是一个强大的编排框架,集成了 LLM、向量存储和自定义逻辑。LangChain 将处理从将用户查询转换为向量表示到获取相关文档以及向 LLM 提供上下文输入的所有操作。

We will use Ollama to run powerful open-source LLMs locally, ensuring all data remains on your machine. For storing and querying document embeddings, you will choose between Facebook AI Similarity Search (Faiss) and Chroma, both optimized for fast, efficient similarity search. The retrieval process will be managed by LangChain, a robust orchestration framework that integrates LLMs, vector stores, and custom logic. LangChain will handle everything from converting user queries into vector representations to fetching relevant documents and prompting the LLM with contextual input.

除了动手开发之外,我们还将研究 RAG 系统的故障点,例如文档分块不佳、嵌入质量问题和检索不匹配等,并探讨缓解这些问题的策略。在本章结束时,您将拥有一个功能齐全的私有单模态 RAG 流水线,并对其设计、权衡和局限性有更深入的了解。

In addition to hands-on development, we will also examine the failure points of RAG systems, such as poor document chunking, embedding quality issues, and retrieval mismatches, and explore strategies to mitigate them. By the end of this chapter, you will have a fully functional, private unimodal RAG pipeline and a deeper understanding of its design, trade-offs, and limitations.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • GPU 在当今生成式人工智能系统中的应用
  • GPU in today’s generative AI systems
  • 使用本地 GPU
  • Using a local GPU
  • 关于奥拉玛
  • About Ollama
  • 使用 Ollama 生成 PDF 文档
  • Generate a PDF document with Ollama
  • RAG 实现
  • RAG implementation
  • RAG面临的挑战
  • Challenges in RAG

目标

Objectives

本章旨在指导您使用本地 LLM 构建一个完全离线、单模态的 RAG 系统。您将学习如何使用 Ollama 运行 LLM,使用 Faiss 或 ChromaDB 存储和搜索文档嵌入,以及使用 LangChain 管理检索和生成工作流程。重点在于创建一个安全、私密且经济高效的 GenAI 流水线,适用于企业或边缘环境。此外,您还将深入了解 RAG 系统中常见的故障点以及如何解决这些故障点,以确保生成更准确、更可靠的 AI 响应。

The objective of this chapter is to guide you through building a fully offline, unimodal RAG system using local LLMs. You will learn to run LLMs with Ollama, store and search document embeddings using Faiss or ChromaDB, and manage the retrieval and generation workflow using LangChain. The focus is on creating a secure, private, and cost-effective GenAI pipeline suitable for enterprise or edge environments. Additionally, you will gain insights into common failure points in RAG systems and how to address them to ensure more accurate and reliable AI-generated responses.

GPU 在当今生成式人工智能系统中的应用

GPU in today’s generative AI systems

在了解开发 RAG 系统的方法之前,了解图形处理单元( GPU ) 在当今 GenAI 应用中的作用非常重要。

Before we understand the ways of developing a RAG system, as it is important to understand the role graphics processing units (GPUs) play in today’s GenAI applications.

GPU 在加速 LLM 的性能以及在 RAG 系统中嵌入模型方面发挥着至关重要的作用。然而,是否需要 GPU 取决于多种因素,包括模型大小、工作负载需求、延迟要求和系统架构。了解何时需要 GPU 以及何时可以使用 GPU,有助于构建高效且经济的 GenAI 系统,尤其是在离线或资源受限的环境中。

GPUs play a critical role in accelerating the performance of LLMs and embedding models within a RAG system. However, whether or not you need a GPU depends on several factors, including model size, workload demands, latency requirements, and system architecture. Understanding when a GPU is necessary and when it is optional helps in building efficient and cost-effective GenAI systems, especially in offline or resource-constrained environments.

让我们来看一些需要使用GPU的情况:

Let us look at situations that need you to have a GPU:

  • 高效运行大型逻辑逻辑模型:如果您的 RAG 系统使用像 Llama、Mistral 或 Mixtral 这样拥有数十亿参数的大型模型,GPU 可以显著加快推理速度。这些模型需要大量的内存带宽和并行计算能力,而 GPU 正是为此而设计的。在中央处理器( CPU ) 上,由于内存限制,这些模型的运行速度可能会非常慢,甚至根本无法运行。
  • Running large LLMs efficiently: If your RAG system uses large models like Llama, Mistral, or Mixtral with billions of parameters, GPUs significantly accelerate inference time. These models require substantial memory bandwidth and parallel computation, which GPUs are designed to handle. On central processing units (CPUs), these models may run extremely slowly or not at all due to memory limitations.
  • 低延迟要求:对于实时应用,例如聊天机器人、客户支持助手或交互式文档搜索,低延迟至关重要。GPU 可以将响应时间从几秒缩短到几毫秒,从而显著提升用户体验。
  • Low-latency requirements: For real-time applications, such as chatbots, customer support assistants, or interactive document search, low-latency is essential. A GPU can reduce response times from seconds to milliseconds, greatly improving user experience.
  • 批量处理和高吞吐量在企业级应用中,需要同时处理大量查询,GPU 有助于实现高吞吐量。它们能够为多个用户或文档提供并行计算,从而使系统具有可扩展性。
  • Batch processing and high throughput: In enterprise applications, where many queries are processed simultaneously, GPUs help achieve high throughput. They enable parallel computation for multiple users or documents, making the system scalable.
  • 模型训练或微调:虽然大多数 RAG 系统侧重于推理,但一些高级设置需要对 LLM 或嵌入模型进行微调。由于计算量和内存负载巨大,这项任务在 CPU 上几乎无法完成——GPU 在此阶段至关重要。
  • Training or fine-tuning models: Although most RAG systems focus on inference, some advanced setups require fine-tuning of LLMs or embedding models. This task is practically impossible on CPUs due to massive compute requirements and memory load—GPUs are essential for this phase.
  • 嵌入大型文档集:使用诸如句子转换器(Sentence Transformer)或双向编码器表示BERT )之类的模型将文档转换为嵌入向量,在 CPU 上速度可能非常慢。如果您要处理数千个文档,GPU 可以显著加快嵌入步骤的速度。
  • Embedding large document sets: Converting documents into embeddings using models like Sentence Transformers or Bidirectional Encoder Representations from Transformers (BERT) can be very slow on CPUs. If you are processing thousands of documents, a GPU speeds up the embedding step dramatically.
  • 使用小型或量化模型:如果您的 RAG 系统采用较小的 LLM(例如 TinyLlama、DistilGPT2)或较大模型的量化版本(例如 4 位或 8 位量化 Llama),则可以在现代 CPU 上流畅运行推理。例如,Ollama 针对此类用例进行了优化,无需 GPU 即可高效运行量化模型。
  • Using small or quantized models: If your RAG system employs smaller LLMs (e.g., TinyLlama, DistilGPT2) or quantized versions of larger models (like 4-bit or 8-bit quantized Llama), you can run inference reasonably well on modern CPUs. Ollama, for instance, is optimized for such use cases and can run quantized models efficiently without a GPU.
  • 非实时应用对于批量任务,例如夜间文档索引、报表生成或内部知识库查询等,如果延迟不是问题,则基于 CPU 的执行方式是可以接受的。这些进程即使运行速度较慢,也不会影响用户体验。
  • Non-real-time applications: For batch tasks, such as nightly document indexing, report generation, or internal knowledge base querying, where latency is not a concern, CPU-based execution is acceptable. These processes can run slowly without affecting the user experience.
  • 硬件受限的边缘或离线部署在边缘设备或GPU资源有限或功耗受限的安全环境中,使用CPU通常是唯一的选择。模型量化和高效检索策略(例如,预过滤嵌入)等优化措施可以弥补GPU加速的不足。

    概念验证或小批量使用:对于原型设计、学术探索或使用频率较低的小规模系统,CPU 执行可能就足够了。这可以降低成本和复杂性,使系统更易于部署和维护。

  • Edge or offline deployments with hardware constraints: In edge devices or secure environments where GPU availability is limited or power consumption is a concern, using CPUs is often the only option. Optimizations like model quantization and efficient retrieval strategies (e.g., pre-filtering embeddings) can help compensate for the lack of GPU acceleration.

    Proof-of-concept or low-volume use: For prototyping, academic exploration, or small-scale systems with infrequent usage, CPU execution may suffice. This lowers cost and complexity, making the system easier to deploy and maintain.

虽然GPU对于加速GenAI工作负载至关重要,但云部署和本地部署之间的选择会影响成本和控制。在许多实际场景中,使用本地GPU可以提供更经济高效的解决方案。

While GPUs are essential for accelerating GenAI workloads, the choice between cloud and local deployment impacts both cost and control. Using a local GPU can offer a more economical and efficient solution in many real-world scenarios.

使用本地 GPU

Using a local GPU

低层管理(LLM)可以根据其主要任务和架构设计目标进行大致分类。使用本地GPU配置比依赖云端GPU服务成本更低,尤其是在工作负载可预测、持续或对隐私敏感的场景下。云端GPU提供商,例如亚马逊网络服务AWS )、谷歌云平台GCP )和微软Azure,通常按小时或分钟收费,成本可能很高。尤其是对于像A100H100这样的高端 GPU 而言,这些费用会迅速攀升。对于从事长时间运行的 GenAI 任务(例如文档处理、实时 RAG 系统或 LLM 微调)的团队来说,这些费用会迅速累积。相比之下,投资本地 GPU 工作站虽然前期成本可能较高,但从长远来看可以节省大量成本。一旦硬件购置完毕,本地运行模型的成本主要就仅限于电力和维护费用。

LLMs can be broadly classified based on their primary tasks and architectural design goals. Using a local GPU setup can be significantly more cost-friendly than relying on cloud-based GPU services, particularly in scenarios where workloads are predictable, continuous, or privacy-sensitive. Cloud GPU providers, like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, typically charge by the hour or minute, and costs can escalate quickly, especially for high-end GPUs like the A100 or H100. For teams working on long-running GenAI tasks, such as document processing, real-time RAG systems, or LLM fine-tuning, these charges accumulate rapidly. In contrast, investing in a local GPU workstation, though it may involve a higher upfront cost, can result in substantial savings over time. Once the hardware is paid for, the cost of running models locally is limited mostly to electricity and maintenance.

此外,本地 GPU 能够更好地控制资源利用率和调度,使其更适合持续或迭代开发。在云环境中,用户通常需要等待实例可用、处理会话超时或支付额外的存储和网络费用。本地基础设施消除了这些低效之处,使开发人员能够按需运行进程而无需承担额外成本。这在频繁运行实验、重新训练模型或对大型数据集执行批量推理的环境中尤为有利。例如,嵌入数百个文档或使用量化 LLM 运行本地推理都可以在无需支付云平台按小时计费的情况下完成。

Moreover, local GPUs offer better control over resource utilization and scheduling, making them more efficient for continuous or iterative development. In cloud environments, users often have to wait for instance availability, deal with session timeouts, or manage additional storage and network fees. Local infrastructure eliminates these inefficiencies, allowing developers to run processes on demand without incurring extra costs. This is particularly beneficial in environments where experiments are run frequently, models are retrained, or batch inference is performed on large datasets. For example, embedding hundreds of documents or running local inference with quantized LLMs can be done without the hourly costs that cloud platforms impose.

本地GPU有助于在注重隐私或离线部署中更好地控制成本。许多企业或政府机构都制定了严格的数据治理政策,禁止将敏感数据上传到外部云服务。在内部GPU上运行带有本地LLM的GenAI系统,不仅可以确保数据保留在本地,还可以避免持续增加合规性要求高昂的云部署成本。在这种情况下,使用云GPU可能需要额外的安全层、虚拟私有云VPC )和专用实例,从而进一步增加成本。本地GPU部署一旦完成,即可为部署高级AI系统提供安全且经济可持续的平台。对于具有稳定需求和长期GenAI目标的组织而言,本地GPU是一项明智的投资,随着时间的推移将带来丰厚的回报。

Local GPUs support better cost control in privacy-focused or offline deployments. Many enterprises or government institutions have strict data governance policies that prohibit uploading sensitive data to external cloud services. Running GenAI systems with local LLMs on in-house GPUs not only ensures data remains on-premises but also avoids the ongoing cost of compliance-heavy cloud setups. In such cases, cloud GPU usage might require additional security layers, virtual private clouds (VPCs), and dedicated instances, further increasing costs. A local GPU setup, once in place, provides both a secure and economically sustainable platform for deploying advanced AI systems. For organizations with consistent needs and long-term GenAI goals, local GPUs represent a smart investment with a high return over time.

在本地运行 LLM 需要平衡硬件容量、软件工具和模型优化技术,以实现快速、可靠的推理,而无需依赖云服务。以下步骤概述了如何选择合适的模型大小、运行时环境和部署策略,从而将高级 AI 功能完全部署在设备端,确保隐私、控制和离线可用性:

Running a LLM locally requires balancing hardware capacity, software tools, and model optimization techniques to achieve fast, reliable inference without relying on cloud services. The following steps outlines how selecting the right model size, runtime environment, and deployment strategy, can bring advanced AI capabilities entirely on-device, ensuring privacy, control, and offline availability:

1. 硬件要求:主要因素是模型尺寸以及您使用的是 CPU 还是 GPU。

1. Hardware requirements: The main factor is model size and whether you use CPU or GPU.

下表比较了在本地运行不同模型规模的LLM所需的硬件要求,包括流畅推理所需的CPU内存和GPU显存。它可以帮助您根据可用的计算资源选择合适的模型规模。

The following table compares hardware requirements for running LLMs locally at different model sizes, showing the CPU RAM and GPU VRAM needed for smooth inference. It helps you choose the right model size based on your available computing resources.

模特尺寸

Model size

CPU内存(量化)

CPU RAM (quantized)

GPU显存(全精度)

GPU VRAM (full precision)

示例模型

Example models

3–7B

3–7B

8–16 GB

8–16 GB

6–8 GB

6–8 GB

米斯特拉尔7B,羊驼2 7B

Mistral 7B, Llama 2 7B

13B

13B

16–24 GB

16–24 GB

12–16 GB

12–16 GB

羊驼 2 13B

Llama 2 13B

30B+

30B+

32–64 GB

32–64 GB

24+ GB

24+ GB

羊驼 2 33B,Mixtral 8x7B

Llama 2 33B, Mixtral 8x7B

表 3.1:在不同模型规模下本地运行 LLM 的硬件要求

Table 3.1: Hardware requirements for running LLMs locally at different model sizes

一个。 仅使用 CPU 即可实现,采用 4-8 位量化(速度较慢但成本较低)。

a. CPU-only is possible with 4-8-bit quantization (slower but cheaper).

b. GPU 大幅提升速度( 7B 型号需配备NVIDIA RTX 3060/4060及以上显卡)。

b. GPU drastically improves speed (NVIDIA RTX 3060/4060 and above for 7B models).

2. 软件

2. Software:

模型运行时:在本地加载和运行 LLM。

Model runtime: To load and run the LLM locally.

llama.cpp :一个轻量级的 C++ 运行器。

llama.cpp: A lightweight C++ runner.

Ollama :一个简单的本地模型管理器。

Ollama: A simple local model manager.

vLLM :高性能 GPU 推理。

vLLM: A high-performance GPU inference.

Python 或 API 环境

Python or API environment:

o 变形金刚(拥抱脸):用于在本地加载模型。

o Transformers (Hugging Face): To load models locally.

o 加速:优化多 GPU 或混合精度。

o Accelerate: To optimize multi-GPU or mixed precision.

o bitsandbytes :支持量化以减少 RAM 使用。

o bitsandbytes: Quantization support for low RAM use.

3. 模型文件

3. Model files:

a. 从 Hugging Face 或类似网站下载。

a. Downloaded from Hugging Face or similar.

b. 通常情况下, llama.cpp.bin.gguf文件,而 transformers 为 PyTorch .pth文件

b. Usually .bin or .gguf files for llama.cpp, or PyTorch .pth for transformers.

c. 量化版本(4 位、8 位)使局部推理成为可能。

c. Quantized versions (4-bit, 8-bit) make local inference practical.

4. 部署模式

4. Deployment Patterns:

a. 基于命令行|在终端中运行以进行快速测试。

a. CLI-based | run in terminal for quick tests.

b. 本地 API 服务器|为其他应用程序(例如 FastAPI、Flask)公开端点。

b. Local API server | expose endpoints for other apps (e.g., FastAPI, Flask).

c. 集成到应用程序中 | 直接从 Python 或 Node.js 脚本调用模型。

c. Integrated in apps | call model directly from Python or Node.js scripts.

5. 性能技巧

5. Performance tips:

a. 使用量化来缩小模型大小并减少内存需求。

a. Use quantization to shrink model size and reduce memory needs.

b. 为了提高效率,建议选择 Mistral、Llama、Phi 型号。

b. Prefer Mistral, Llama, Phi models for efficiency.

c. 如果不需要,则减少上下文长度(减少标​​记|加快推理速度)。

c. Reduce context length if not needed (fewer tokens | faster inference).

d. 将模型存储在 SSD 上,以加快加载速度。

d. Store models on SSD for faster load times.

随着我们探索更高效、更私密的 GenAI 部署方案,向本地 RAG 系统的转变变得越来越有吸引力。Ollama、Unsloth 等工具以及轻量级嵌入模型使得完全在本地硬件上构建强大的 RAG 流水线成为可能。我们将实现如图 3.1所示的架构。

As we explore more efficient and private GenAI deployments, the shift toward local RAG systems becomes increasingly attractive. Tools like Ollama, Unsloth, and lightweight embedding models make it practical to build powerful RAG pipelines entirely on local hardware. We will implement the following architecture shown in Figure 3.1.

建筑组件

Architectural components

下图展示了 RAG 系统的架构,RAG 系统是 GenAI 中一种流行的框架,它将基于检索的方法与 LLM 相结合,以提供准确的、上下文感知的答案:

The following figure represents the architecture of a RAG system, a popular framework in GenAI that combines retrieval-based methods with LLMs to provide accurate, context-aware answers:

流程图显示用户提交查询,使用嵌入模型将查询与矢量数据库中的文档块进行匹配,然后由 LLM 进行矢量搜索和结果生成,并将结果返回给用户。

图 3.1 我们将要实施的 RAG 系统

Figure 3.1: A RAG system which we will be implementing

第 1 章“新时代生成式人工智能简介”中所述,上图的工作原理如下:

As explained in Chapter 1, Introducing New Age Generative AI, here is how the preceding figure works:

1. 文档处理:首先摄取原始文档,然后将其分割成更小的块,以提高搜索粒度和检索准确性。

1. Document processing: Raw documents are first ingested and then split into smaller chunks to improve search granularity and retrieval accuracy.

2. 嵌入生成:这些块通过嵌入模型(例如 OpenAI 或本地替代方案)进行处理,从而转换为高维向量表示。

2. Embedding generation: These chunks are passed through an embedding model (such as OpenAI or a local alternative), which converts them into high-dimensional vector representations.

注意:分块是离线活动;为了简单起见,我们已在流程图中展示了它。

Note: The chunking is an offline activity; to make it simple, we have shown it in the flow.

3. 向量数据库:生成的向量嵌入存储在向量数据库(例如,Faiss、Chroma)中。该数据库通过比较向量来实现快速相似性搜索。

3. Vector database: The resulting embeddings are stored in a vector database (e.g., Faiss, Chroma). This database enables fast similarity searches by comparing vectors.

4. 用户查询:当用户提交查询时,也会使用相同的嵌入模型将其转换为向量。

4. User query: When a user submits a query, it is also converted into a vector using the same embedding model.

5. 向量搜索:将查询向量与存储的文档向量进行匹配,以检索最相关的块。

5. Vector search: The query vector is matched against the stored document vectors to retrieve the most relevant chunks.

6. LLM 处理:这些检索到的数据块作为上下文发送到 LLM,然后 LLM 生成连贯且知情的响应。

6. LLM processing: These retrieved chunks are sent as context to a LLM, which then generates a coherent and informed response.

7. 结果交付:将最终输出结果返回给用户。

7. Result delivery: The final output is returned to the user.

关于奥拉玛

About Ollama

Ollama 是一款功能强大且易于使用的工具,旨在简化本地运行 LLM 的过程。它提供了一个简洁的界面和运行时环境,用于在您自己的计算机上下载、管理和执行 Llama、Mistral 等模型。只需一条命令,您就可以拉取预配置的模型,并在安全的环境中开始与它们交互。支持离线设置。Ollama 可处理后端优化,包括模型量化(例如 4 位和 8 位)、高效的内存使用以及硬件加速(GPU/CPU),使其成为开发者、研究人员和企业构建私有 GenAI 应用的理想选择。

Ollama is a powerful, yet user-friendly tool designed to simplify the process of running LLMs locally. It provides a clean interface and runtime environment for downloading, managing, and executing models such as Llama, Mistral, and others on your own machine. With just a single command, you can pull pre-configured models and start interacting with them in a secure, offline setting. Ollama handles backend optimizations, including model quantization (e.g., 4-bit and 8-bit), efficient memory usage, and hardware acceleration (GPU/CPU), making it ideal for developers, researchers, and enterprises seeking to build private GenAI applications.

Ollama 的主要优势之一在于其对隐私和简洁性的重视。由于所有操作都在本地运行,数据不会离开您的计算机,因此非常适合敏感或受监管的环境。它还支持与 LangChain 等框架集成,使其成为 RAG 流水线和其他 GenAI 工作流程的理想选择。

One of Ollama’s key advantages is its focus on privacy and simplicity. Since everything runs locally, no data leaves your machine, making it suitable for sensitive or regulated environments. It also supports integration with frameworks like LangChain, making it an excellent choice for RAG pipelines and other GenAI workflows.

Ollama 的替代品

Alternatives to Ollama

其他一些工具和框架也提供类似的本地LLM功能,详情如下:

Several other tools and frameworks provide similar local LLM capabilities, details as follows:

  • LM Studio :一款桌面应用程序,允许您在本地运行 LLM 并与之聊天。它包含一个图形用户界面,并支持从 Hugging Face 导入模型。
  • LM Studio: A desktop app that allows you to run and chat with LLMs locally. It includes a GUI and supports model imports from Hugging Face.
  • GPT4All :提供可下载的模型生态系统和用于本地运行的简易界面。它针对消费级硬件进行了优化。
  • GPT4All: Offers a downloadable ecosystem of models and a simple interface for running them locally. It is optimized for consumer-grade hardware.
  • 文本生成 Web UI :一个高度可定制的基于浏览器的界面,用于运行各种 LLM,并可对参数和模型设置进行精细控制。
  • Text Generation Web UI: A highly customizable browser-based interface for running a variety of LLMs with fine control over parameters and model settings.
  • Unsloth :专注于使用消费级 GPU 快速微调 LLM,使其成为自定义模型训练的理想选择。
  • Unsloth: Focuses on fast fine-tuning of LLMs using consumer GPUs, making it ideal for custom model training.
  • AutoGPTQ + transformers :一种基于 Python 的设置,允许开发人员加载量化模型以进行快速推理,而无需依赖云。
  • AutoGPTQ + transformers: A Python-based setup that allows developers to load quantized models for fast inference without cloud dependence.

这些工具各自针对略有不同的使用场景,但它们的共同目标是在不依赖云 API 的情况下,普及对强大的 LLM 的访问。

Each of these tools caters to slightly different use cases, but all share the goal of democratizing access to powerful LLMs without relying on cloud APIs.

让我们通过以下步骤,在您的本地计算机上安装 Ollama 并运行 Ollama 服务器:

Let us go through a step-by-step guide to install Ollama on your local machine and run the Ollama server:

1. 检查系统要求:安装前,请确保您已满足以下条件:

1. Check system requirements: Before installation, ensure you have:

  1. 64 位 CPU
  2. A 64-bit CPU
  3. macOS(M1/M2 或 Intel)、Linux(基于 Ubuntu/Debian)或 Windows(通过 WSL2)
  4. macOS (M1/M2 or Intel), Linux (Ubuntu/Debian-based), or Windows (via WSL2)
  5. 8GB以上内存(建议16GB或以上)
  6. 8GB+ RAM (16GB or more recommended)
  7. 可选配GPU(建议使用NVIDIA显卡以获得加速)
  8. Optionally, a GPU (NVIDIA recommended for acceleration)

2. 安装 Ollama

2. Install Ollama:

  1. 在 macOS 系统上(使用 Homebrew):

    brew install ollama

    b. 在 Linux 系统上

    curl -fsSL https://ollama.com/install.sh | sh

  2. On macOS (with Homebrew):

    brew install ollama

    b. On Linux:

    curl -fsSL https://ollama.com/install.sh | sh

这将安装 Ollama CLI 并设置环境。

This will install the Ollama CLI and setup the environment.

3. 在 Windows系统上(通过 WSL2)

3. On Windows (via WSL2):

  1. 在Ubuntu上安装WSL2。
  2. Install WSL2 with Ubuntu.
  3. 在 WSL 终端中运行:

    curl -fsSL https://ollama.com/install.sh | sh

  4. Inside WSL terminal, run:

    curl -fsSL https://ollama.com/install.sh | sh

4. 启动 Ollama服务器:安装完成后,启动 Ollama 服务器:

4. Start the Ollama server: Once installed, start the Ollama server:

羊驼服务。

ollama serve.

这会在后台运行 Ollama 服务器,准备加载和运行模型。

This runs the Ollama server in the background, ready to load and run models.

5. 运行模型(例如 Llama 3 或 Minstral):下载模型并开始与模型聊天:

5. Run a model (e.g., Llama 3 or Minstral): To download and start chatting with a model:

羊驼跑羊驼3

ollama run llama3

这将实现以下功能:

This will do the following:

  1. 从 Ollama 的模型注册表中提取模型(仅限首次使用
  2. Pull the model from Ollama’s model registry (first time only)
  3. 在终端上启动交互式聊天界面

    6. 可选步骤将 Ollama 与 LangChain 或 API 配合使用。Ollama 默认公开一个本地 HTTP API,地址为:

    http://localhost:11434

    现在,您可以使用 REST 或 LangChain 等库将 Ollama 集成到应用程序中,以实现本地 RAG 管道。

    如果您是 Mac 用户,您会看到 Ollama:

    Mac 菜单栏显示 iCloud、红色高亮显示的浣熊图标、电池、Wi-Fi、蓝牙、扬声器、彩色图标以及日期和时间:5 月 21 日星期三晚上 9:45。

    图 3.2:Ollama 安装

    您还可以使用以下命令列出所有LLM:

    奥拉玛名单

    终端截图列出了两个 Docker 镜像:mistral:latest (4.1 GB) 和 llama3:8b-instruct-fp16 (6.4 GB),两者均在 4 周前修改,显示了它们的 ID 和大小。

    图 3.3:该图显示了 Ollama 中本地安装的模型的终端输出列表。

    上图所示的两个模型是:

    • mistral:latest , IDf974a74358d6 大小4.1 GB
    • llama3.2:3b-instruct-fp16 ,ID 为195a8c01d91e 大小6.4 GB
  4. Start an interactive chat interface in your terminal

    6. Optional step: Use Ollama with LangChain or API. Ollama exposes a local HTTP API by default at:

    http://localhost:11434

    You can now integrate Ollama into applications using REST or libraries like LangChain for local RAG pipelines.

    If you are a Mac user, you will see Ollama:

    Figure 3.2: Ollama installation

    You can also list all LLMs using the command:

    Ollama list

    Figure 3.3: The image shows a terminal output listing locally installed model in Ollama

    The two models shown in the preceding figure are:

    • mistral:latest, with an ID f974a74358d6, a SIZE of 4.1 GB.
    • llama3.2:3b-instruct-fp16, with an ID 195a8c01d91e, SIZE of 6.4 GB.

这两个模型均在四周前进行了修改,表明近期进行了设置或更新。这证实了本地环境已准备就绪,可以使用 Ollama CLI 或 API 运行这些模型进行离线 LLM 推理。

Both models were modified four weeks ago, indicating recent setup or updates. This confirms that the local environment is prepared to run these models using the Ollama CLI or API for offline LLM inference.

使用 Ollama 生成 PDF 文档

Generate a PDF document with Ollama

现在你已经了解了 Ollama,让我们用它来生成一个 PDF 文档,我们稍后会在我们的 GenAI 系统中使用它。

Now that you understand Ollama, let us use it to generate a PDF document, which we will later use in our GenAI system.

以下是一个完整的 Python 脚本:

Here is an end-to-end Python script that:

  • 直接使用 Ollama REST API。
  • Uses the Ollama REST API directly.
  • 向llama3.2:3b-instruct-fp16模型发送提示
  • Sends a prompt to the llama3.2:3b-instruct-fp16 model.
  • 生成最多 600 字的文档。
  • Generates a document of up to 600 words.
  • 使用reportlab将输出保存为 PDF 文件
  • Saves the output as a PDF file using reportlab.

先决条件是:

The prerequisites are:

  • Ollama已安装并运行
  • Ollama installed and running:

奥拉玛服务

ollama serve

  • 模型已撤回

    ollama 运行 llama3.2:3b-instruct-fp16

  • Model pulled:

    ollama run llama3.2:3b-instruct-fp16

  • 已安装的 Python 包

    pip install requests reportlab

  • Python packages installed:

    pip install requests reportlab

具备了所有前提条件后,我们现在可以编写一个 Python 脚本,将所有内容连接起来。

With the prerequisites in place, we can now write a Python script that ties everything together.

此脚本将执行以下操作:

This script will:

  • 通过 Ollama REST API 向llama3.2:3b-instruct-fp16模型发送提示。
  • Send a prompt to the llama3.2:3b-instruct-fp16 model through the Ollama REST API.
  • 撰写一篇内容翔实的文章(约 600 字)。
  • Generate an informative article (about 600 words).
  • 使用 reportlab 将生成的文本保存为格式化的 PDF 文件。
  • Save the generated text as a formatted PDF file using reportlab.

以下是完整的代码:

The following is the complete code:

导入请求

import requests

from reportlab.lib.pagesizes import LETTER

from reportlab.lib.pagesizes import LETTER

从 reportlab.pdfgen 导入 canvas

from reportlab.pdfgen import canvas

导入 textwrap

import textwrap

OLLAMA_URL = “http://localhost:11434/api/generate”

OLLAMA_URL = “http://localhost:11434/api/generate”

MODEL_NAME = "llama3.2:3b-instruct-fp16"

MODEL_NAME = "llama3.2:3b-instruct-fp16"

def generate_text(topic, max_words=600):

def generate_text(topic, max_words=600):

提示 = (

prompt = (

请撰写一篇关于“{topic}”的说明性文章,字数约为{max_words}。

f"Write an informative article about '{topic}' with approximately {max_words} words. "

f“文章结构应包括引言、正文和结论。”

f"Structure the article with an introduction, body, and conclusion."

)

response = requests.post(OLLAMA_URL, json={

response = requests.post(OLLAMA_URL, json={

“模型”: MODEL_NAME,

"model": MODEL_NAME,

“提示”:提示,

"prompt": prompt,

“流”:否

"stream": False

})

})

如果 response.status_code == 200:

if response.status_code == 200:

返回 response.json()["response"].strip()

return response.json()["response"].strip()

别的:

else:

raise Exception(f"错误:{response.status_code} - {response.text}")

raise Exception(f"Error: {response.status_code} - {response.text}")

def save_to_pdf(text, filename):

def save_to_pdf(text, filename):

pdf = canvas.Canvas(filename, pagesize=LETTER)

pdf = canvas.Canvas(filename, pagesize=LETTER)

宽度,高度 = 字母

width, height = LETTER

边距 = 50

margin = 50

text_object = pdf.beginText(margin, height - margin)

text_object = pdf.beginText(margin, height - margin)

text_object.setFont("Times-Roman", 12)

text_object.setFont("Times-Roman", 12)

wrapped_lines = []

wrapped_lines = []

for paragraph in text.split("\n"):

for paragraph in text.split("\n"):

wrapped_lines.extend(textwrap.wrap(paragraph, width=90))

wrapped_lines.extend(textwrap.wrap(paragraph, width=90))

wrapped_lines.append("")

wrapped_lines.append("")

对于 wrapped_lines 中的每个行:

for line in wrapped_lines:

text_object.textLine(line)

text_object.textLine(line)

如果 text_object.getY() < margin:

if text_object.getY() < margin:

pdf.drawText(text_object)

pdf.drawText(text_object)

pdf.showPage()

pdf.showPage()

text_object = pdf.beginText(margin, height - margin)

text_object = pdf.beginText(margin, height - margin)

text_object.setFont("Times-Roman", 12)

text_object.setFont("Times-Roman", 12)

pdf.drawText(text_object)

pdf.drawText(text_object)

pdf.save()

pdf.save()

如果 __name__ == "__main__":

if __name__ == "__main__":

主题 = “人工智能在现代教育中的作用”

topic = "The Role of Artificial Intelligence in Modern Education"

尝试:

try:

print(f"正在生成关于 {topic} 的文章")

print(f"Generating article on: {topic}")

文章 = generate_text(主题)

article = generate_text(topic)

save_to_pdf(article, "ai_education_article.pdf")

save_to_pdf(article, "ai_education_article.pdf")

print("PDF 生成成功:ai_education_article.pdf")

print("PDF generated successfully: ai_education_article.pdf")

除异常 e 外:

except Exception as e:

print(str(e))

print(str(e))

输出:脚本将创建一个名为ai_education_article.pdf的 PDF 文件

Output: The script will create a PDF file named: ai_education_article.pdf:

黑屏上显示白色文字:正在生成文章:人工智能在现代教育中的作用。PDF 文件生成成功:ai_education_article.pdf。

图 3.4:该图证实脚本已成功执行。

Figure 3.4: The figure confirms successful execution of the script

这表明,使用本地 Ollama LLM 生成了一篇关于“人工智能在现代教育中的作用”的文章,并将输出保存为名为ai_education_article.pdf的 PDF 文件。这说明本地模型运行正常,文档生成过程中未出现任何错误。现在您可以打开 PDF 文件并查看生成的内容。

It shows that an article on the topic The Role of Artificial Intelligence in Modern Education was generated using the local Ollama LLM, and the output was saved as a PDF file named ai_education_article.pdf. This indicates that the local model ran as expected and the document was created without any errors. You are now ready to open the PDF and review the generated content.

本书的GitHub仓库中分享了一个更新后的脚本,该脚本可以生成多个基于主题的PDF文件:

An updated script that generates multiple topic-based PDFs is shared with this book’s GitHub repository:

一个名为“生成的文章”的文件目录,其中包含五个 PDF 文件,主题分别为:可再生能源、人工智能、金融区块链、心理健康意识以及气候变化对农业的影响。

图 3.5:该图显示了运行脚本后生成的所有合成文章。

Figure 3.5: This figure shows what all synthetic articles are generated after running the script

RAG 实现

RAG implementation

现在您已经学会了如何自动生成文档,我们将使用之前创建的ai_education_article.pdf文件来构建一个 RAG 系统。该系统将包含以下组件:

Now, that you have learned how to automatically generate documents, we will take the previously created ai_education_article.pdf and use it to build a RAG system. This system will include the following components:

  • 推理与行动系统提示(ReAct) :我们将使用ReAct提示技术,该技术将复杂或冗长的问题分解为更小、更易于管理的子查询,以提高推理和回答的准确性。
  • System prompting with reasoning and acting (ReAct): We will use the ReAct prompting technique, which breaks down complex or long-form questions into smaller, manageable sub-queries to improve reasoning and response accuracy.
  • 文档分块:PDF 将被分割成更小的块,以便更有效地检索和进行上下文分析。
  • Document chunking: The PDF will be segmented into smaller chunks to allow for more effective retrieval and contextual analysis.
  • 带元数据的向量嵌入:每个数据块将被嵌入到向量表示中,并与相关的元数据一起存储在向量数据库中,以便进行更准确的搜索和检索。
  • Vector embedding with metadata: Each chunk will be embedded into a vector representation and stored in a vector database along with relevant metadata for more accurate search and retrieval.
  • 混合搜索模块:系统将结合语义搜索(基于向量相似性)和基于关键词的搜索来提高检索性能。
  • Hybrid search module: The system will use a combination of semantic search (based on vector similarity) and keyword-based search to enhance retrieval performance.
  • LangChain 编排:LangChain 将作为编排框架,管理查询解析、检索、上下文构建和 LLM 提示之间的流程。
  • LangChain orchestration: LangChain will serve as the orchestration framework, managing the flow between query parsing, retrieval, context building, and LLM prompting.
  • 对话缓冲区:对话记忆缓冲区将确保多轮对话的连续性,并在用户查询之间保留上下文。
  • Conversation buffer: A conversation memory buffer will ensure continuity in multi-turn conversations, preserving context across user queries.
  • 引用支持:一旦生成答案,系统将包含引用,准确显示答案源自哪些文档片段。
  • Citation support: Once an answer is generated, the system will include citations showing exactly which document chunks the answer was derived from.
  • 自然语言生成:最终回复将使用 Ollama 中的 Mistral 模型生成,确保流畅连贯的自然语言输出。
  • Natural language generation: Final responses will be generated using the Mistral model via Ollama, ensuring fluent and coherent natural language output.

图 3.6展示了一个结构化的布局,它体现了一个清晰且可扩展的 RAG 流水线。每个文件夹代表一个关键组件,涵盖了数据摄取(source_docs/ )、嵌入逻辑(embeddings/ )、向量存储(vectorstore/ )、检索策略(retriever/ )、生成逻辑(llm/ )以及基于 LangChain 的编排(orchestrator/ )。这种模块化设计便于定制、调试和维护。用于 PDF 解析和引文跟踪的实用脚本进一步增强了功能,而内存管理则确保了多轮交互的一致性。这种设计不仅支持使用 Mistral 和 Ollama 等本地模型进行离线部署,而且还鼓励在各种 RAG 用例中实现重用和扩展。详情如下:

Figure 3.6 shows a structured layout that exemplifies a clean and scalable RAG pipeline. Each folder represents a critical component, ranging from data ingestion (source_docs/), embedding logic (embeddings/), and vector storage (vectorstore/), to retrieval strategies (retriever/), generation logic (llm/), and LangChain-based orchestration (orchestrator/). The modularity allows for easy customization, debugging, and maintenance. Utility scripts for PDF parsing and citation tracking further enhance functionality, while memory management ensures coherent multi-turn interactions. Such a design not only supports offline deployment using local models like Mistral and Ollama but also encourages reusability and extension across varied RAG use cases. The details are as follows:

  • 模块化:每个职责(嵌入、检索、LLM 等)都是隔离的,以便清晰和可重用。
  • Modular: Each responsibility (embedding, retrieval, LLM, etc.) is isolated for clarity and reusability.
  • 可扩展:易于扩展,可添加更多模型、数据源或新的检索器。
  • Scalable: Easy to extend with more models, data sources, or new retrievers.
  • 易于调试:较小的文件/函数使测试和维护更加容易。
  • Debuggable: Smaller files/functions make it easier to test and maintain.

下图展示了一个模块化 RAG 系统结构清晰的目录结构。它包含了文档导入、嵌入、混合检索、LLM 交互(通过 Ollama)、内存管理、来源引用以及通过 LangChain 进行编排等组件。每个模块都清晰分离,以支持可扩展性、可重用性以及开发和部署的清晰度。

The following figure illustrates a well-organized directory structure for a modular RAG system. It includes components for document ingestion, embedding, hybrid retrieval, LLM interaction (via Ollama), memory handling, source citation, and orchestration through LangChain. Each module is clearly separated to support scalability, reusability, and clarity in development and deployment.

名为 rag-system 的项目的目录树显示了用于数据源处理、嵌入、向量存储、检索、语言建模、编排、内存和 PDF 解析等任务的 Python 文件,并附有简要说明。

图 3.6:RAG 系统的模块化文件夹结构

Figure 3.6: A modular folder structure of a RAG system

以下代码导入了您将需要的所有 LangChain 和 Python 工具:

The following code import all the tools from LangChain and Python that you will need for:

  • 加载和分块文档。
  • Loading and chunking documents.
  • 嵌入和存储向量。
  • Embedding and storing vectors.
  • 执行混合检索。
  • Performing hybrid retrieval.
  • 运行对话式问答链。
  • Running a conversational QA chain.
  • 与当地LLM互动。

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitterC import RecursiveCharacterTextSplitter

    from langchain_community.vectorstores import Chroma

    from langchain.embeddings import OllamaEmbeddings

    from langchain_community.llms import Ollama

    from langchain.prompts import PromptTemplate

    from langchain.chains import ConversationalRetrievalChain

    from langchain.memory import ConversationBufferMemory

    from langchain.retrievers import ContextualCompressionRetriever

    from langchain.retrievers.multi_query import MultiQueryRetriever

    from langchain.retrievers.hybrid import HybridRetriever

    from langchain_community.vectorstores import Chroma

    from langchain.retrievers import BM25Retriever

    导入操作系统

  • Interacting with a local LLM.

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitterC import RecursiveCharacterTextSplitter

    from langchain_community.vectorstores import Chroma

    from langchain.embeddings import OllamaEmbeddings

    from langchain_community.llms import Ollama

    from langchain.prompts import PromptTemplate

    from langchain.chains import ConversationalRetrievalChain

    from langchain.memory import ConversationBufferMemory

    from langchain.retrievers import ContextualCompressionRetriever

    from langchain.retrievers.multi_query import MultiQueryRetriever

    from langchain.retrievers.hybrid import HybridRetriever

    from langchain_community.vectorstores import Chroma

    from langchain.retrievers import BM25Retriever

    import os

加载并分块 PDF 文档

Load and chunk the PDF document

为了启动 RAG 流程,首先使用 LangChain 的PyPDFLoaderRecursiveCharacterTextSplitter将 PDF 文档加载并分割成易于管理的文本块。这一步骤至关重要,它可以将长文档分割成重叠的文本块,在保留上下文的同时确保更精细的检索粒度。通过指定大小和重叠度进行分块,下游的嵌入和检索系统可以高效运行,而不会丢失叙述流程。正如代码和以下列表所述,这些文本块构成了向量化和搜索的基础,从而实现细粒度的语义查找。合理的分块确保系统在用户交互过程中能够提供准确、相关且上下文连贯的信息。

To begin the RAG pipeline, the PDF document is loaded and segmented into manageable text chunks using LangChain’s PyPDFLoader and RecursiveCharacterTextSplitter. This step is essential for breaking long documents into overlapping text blocks, preserving context while ensuring better retrieval granularity. Chunking with a specified size and overlap allows downstream embedding and retrieval systems to work efficiently without losing the narrative flow. These chunks, as explained in the code and the following list form the foundation of vectorization and search, enabling fine-grained semantic lookup. Proper chunking ensures the system responds with accurate, relevant, and contextually coherent information during user interactions.

loader = PyPDFLoader("data/source_docs/ai_education_article.pdf")

loader = PyPDFLoader("data/source_docs/ai_education_article.pdf")

documents = loader.load()

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_size=500,

chunk_overlap=50,

chunk_overlap=50,

分隔符=["\n\n", "\n", " ", ""]

separators=["\n\n", "\n", " ", ""]

)

chunks = text_splitter.split_documents(documents)

chunks = text_splitter.split_documents(documents)

  • 加载 PDF 文件并提取其内容。
  • Loads a PDF file and extracts its content.
  • 将文本分割成重叠的部分,以便以后更容易搜索和检索上下文。
  • Splits the text into overlapping chunks so it is easier to search and retrieve context later.
  • chunk_size = 500字符;50字符chunk_overlap确保块之间过渡更平滑。
  • chunk_size = 500 characters; 50 characters chunk_overlap ensures smoother transitions between chunks.

虽然RecursiveCharacterTextSplitter是 LangChain 中默认且最灵活的分块策略,但您还可以根据文档类型、结构或用例使用其他几种分块方法。

While RecursiveCharacterTextSplitter is the default and most flexible chunking strategy in LangChain, there are several other chunking methods you can use based on your document type, structure, or use case.

LangChain 中的替代组块策略

Alternative chunking strategies in LangChain

LangChain 除了默认的递归方法外,还支持多种文本分割策略。根据文档的结构、语言或领域,您可以选择CharacterTextSplitter TokenTextSplitter SentenceTransformersTextSplitter NLTKTextSplitterSpacyTextSplitter等分割器。每种分割器都各有优势——有些保留语义边界,有些针对 LLM 标记限制进行优化,还有一些可以处理 Markdown 等结构化格式。选择合适的分割器对于保持内容一致性和优化嵌入质量至关重要,尤其对于问答、摘要或检索等应用而言。这种模块化设计使得在 RAG 流程中能够精确控制文档准备工作:

LangChain supports multiple text splitting strategies beyond the default recursive method. Depending on the structure, language, or domain of your documents, you can choose splitters like CharacterTextSplitter, TokenTextSplitter, SentenceTransformersTextSplitter, NLTKTextSplitter, or SpacyTextSplitter. Each offers unique benefits—some preserve semantic boundaries, others optimize for LLM token limits, and a few handle structured formats like Markdown. Selecting the right splitter is crucial for maintaining content coherence and optimizing embedding quality, especially for applications like question answering, summarization, or retrieval. This modularity enables precise control over document preparation in a RAG pipeline:

  • 字符文本分割器

    from langchain.text_splitter import CharacterTextSplitter

    • 严格按字符数分割文本。
    • 不使用递归分割器之类的层次分隔符。
    • 简单快捷,但可能会在句子中间断开。

    使用场景:当你需要大小一致的块,并且不介意粗略的分割时。

  • CharacterTextSplitter:

    from langchain.text_splitter import CharacterTextSplitter

    • Splits text strictly by character count.
    • Does not use hierarchical separators like a recursive splitter.
    • Simple and fast, but might split in the middle of sentences.

    Use case: When you need consistently sized chunks and do not mind rough breaks.

  • TokenTextSplitter

    from langchain.text_splitter import TokenTextSplitter

    • 按词元数而非字符数分割文本。
    • 使用特定 LLM(例如 OpenAI、Hugging Face)的分词器来避免提示大小溢出。

    使用场景:当使用像 GPT 或 Mistral 这样的基于令牌的模型时。

  • TokenTextSplitter:

    from langchain.text_splitter import TokenTextSplitter

    • Splits text based on token count, not characters.
    • Uses the tokenizer of a specific LLM (e.g., OpenAI, Hugging Face) to avoid prompt size overflow.

    Use case: When working with token-limited models like GPT or Mistral.

  • SentenceTransformersTextSplitter

    from langchain.text_splitter import SentenceTransformersTextSplitter

    • 利用句子边界和语义相似性来分割文本。
    • 创建更连贯的数据块,以提高嵌入质量。

    使用场景:当您需要语义上有意义的块时(尤其用于问答或摘要)。

  • SentenceTransformersTextSplitter:

    from langchain.text_splitter import SentenceTransformersTextSplitter

    • Uses sentence boundaries and semantic similarity to split text.
    • Creates more coherent chunks for better embedding quality.

    Use case: When you want semantically meaningful chunks (especially for QA or summarization).

  • NLTKTextSplitter

    from langchain.text_splitter import NLTKTextSplitter

    o 使用自然语言工具包( NLTK ) 将文本拆分成句子。

    • 适用于结构良好的英文文本。

    使用场景:无需手动逻辑即可实现基于句子的清晰分块。

  • NLTKTextSplitter:

    from langchain.text_splitter import NLTKTextSplitter

    o Uses the Natural Language Toolkit (NLTK) to split text into sentences.

    • Good for well-structured English text.

    Use case: Clean sentence-based chunking without manual logic.

  • SpacyTextSplitter

    from langchain.text_splitter import SpacyTextSplitter

    • 使用 spaCy自然语言处理( NLP ) 库,根据语言特征(句子、段落)进行分割。
    • 与基于原始字符的方法相比,它能更好地处理标点符号和句子边界。

    使用场景:当您需要在多种语言中进行语言学上准确的拆分时。

  • SpacyTextSplitter:

    from langchain.text_splitter import SpacyTextSplitter

    • Uses the spaCy natural language processing (NLP) library to split based on linguistic features (sentences, paragraphs).
    • Handles punctuation and sentence boundaries better than raw character-based methods.

    Use case: When you want linguistically accurate splitting in multiple languages.

  • MarkdownHeaderTextSplitter

    from langchain.text_splitter import MarkdownHeaderTextSplitter

    • 使用标题级别(例如, # ## ### )作为结构指南拆分 Markdown 文档。

    使用场景:用于文档、博客或 README 风格的内容,其中标题指示主题更改。

  • MarkdownHeaderTextSplitter:

    from langchain.text_splitter import MarkdownHeaderTextSplitter

    • Splits Markdown documents using header levels (e.g., #, ##, ###) as structural guides.

    Use case: For documentation, blogs, or README-style content where headers indicate topic changes.

  • 您还可以组合使用分隔符:首先使用MarkdownHeaderTextSplitter ,然后对每个部分使用RecursiveCharacterTextSplitter,以实现精确性和结构性。
  • You can also combine splitters: Use MarkdownHeaderTextSplitter first, then RecursiveCharacterTextSplitter on each section for precision and structure.

创建带有元数据的嵌入

Creating embeddings with metadata

分块后,每个文本片段都通过OllamaEmbeddings等嵌入模型(例如 Mistral)转换为高维向量。这些嵌入以数值形式表示语义含义,从而实现高效的相似性搜索。本地向量数据库 Chroma 存储这些向量以及文档来源等元数据,从而实现可追溯的检索。将这些信息持久化到数据库目录中,无需重新嵌入即可重复使用。元数据增强了下游任务(例如按来源或时间过滤)的性能。如下面的代码和列表所示,此步骤将非结构化文本转换为结构化的、可查询的内存,使其成为在注重隐私的离线部署中实现智能文档检索的基础:

After chunking, each text segment is converted into high-dimensional vectors using embedding models like Mistral via OllamaEmbeddings. These embeddings numerically represent semantic meaning, allowing efficient similarity search. Chroma, a local vector database, stores these vectors along with metadata such as document source, enabling traceable retrieval. Persisting this information in the db directory allows reuse without re-embedding. Metadata enhances downstream tasks like filtering by source or time. This step as shown in the following code and list transforms unstructured text into structured, queryable memory, making it foundational for intelligent document retrieval in privacy-first, offline deployments:

embedding_model = OllamaEmbeddings(model="mistral")

embedding_model = OllamaEmbeddings(model="mistral")

vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")

vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")

  • 使用 Mistral 模型将每个数据块转换为数值向量嵌入。
  • Uses the Mistral model to convert each chunk into a numerical vector embedding.
  • 将这些向量以及元数据(如来源信息)存储在 Chroma 向量数据库中。
  • Stores these vectors in a Chroma vector database, along with metadata (like source info).
  • 数据库保存在本地的db文件夹中。

    注意:如果要重用现有索引,请将上一行替换为:

    如果 os.path.exists("db/index.sqlite3"):

    vectorstore = Chroma(persist_directory="db",

    embedding_function=embedding_model)

    别的:

    vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")

  • The database is saved locally in the db folder.

    Note: If you want to reuse an existing index, replace the preceding line with:

    if os.path.exists("db/index.sqlite3"):

    vectorstore = Chroma(persist_directory="db",

    embedding_function=embedding_model)

    else:

    vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")

以下是一些可与OllamaEmbeddings一起使用的常用嵌入模型,您可以选择最适合您的 RAG 管道需求的模型:

Here are some popular embedding models you can use with OllamaEmbeddings, so you can choose the one that best fits your RAG pipeline needs:

  • mxbai-embed-large(~334M 参数):语义搜索性能强劲,经常与 OpenAI 的 ada-002 进行比较。
  • mxbai-embed-large (~334M params): Strong performance for semantic search, often compared to OpenAI's ada-002.
  • nomic-embed-text(~25.8M 个参数):高性能,支持长上下文(2K 个标记),因优于较旧的商业模型而获得认可。
  • nomic-embed-text (~25.8M params): High-performance with long context support (2K tokens), recognized for outperforming older commercial models.
  • all-minilm :一种紧凑的 Sentence Transformers 风格模型,适用于快速嵌入任务。
  • all-minilm: A compact Sentence Transformers–style model suited for quick embedding tasks.
  • bge-m3 (BAAI 通用嵌入,~567M 参数):多语言且用途广泛的嵌入模型,以强大的检索准确率和灵活性而著称。
  • bge-m3 (BAAI General Embedding, ~567M params): Multilingual and versatile embedding model, noted for strong retrieval accuracy and flexibility.
  • snowflake-arctic-embed/snowflake-arctic-embed2 :一系列不同大小(例如 22M、110M、335M)的嵌入模型,针对速度和多语言支持进行了优化。
  • snowflake-arctic-embed/snowflake-arctic-embed2: A family of embedding models with different sizes (e.g., 22M, 110M, 335M) optimized for speed and multilingual support.
  • granite-embedding :IBM 的多语言嵌入模型(约 3000 万或 2.78 亿个参数),适用于跨语言环境。
  • granite-embedding: IBM’s multilingual embedding models (~30M or 278M parameters), suitable for cross-language contexts.

在代码中使用它们

Using them in code

以下是一个简单的示例,展示如何在不同的 Ollama 嵌入模型之间切换:

The following is a quick example showing how to switch between different Ollama embedding models:

from langchain.embeddings import OllamaEmbeddings

from langchain.embeddings import OllamaEmbeddings

# 请选择以下型号之一

# Choose one of the models below

model_name = "mxbai-embed-large"

model_name = "mxbai-embed-large"

# model_name = "nomic-embed-text"

# model_name = "nomic-embed-text"

# model_name = "all-minilm"

# model_name = "all-minilm"

# model_name = "bge-m3"

# model_name = "bge-m3"

embeddings = OllamaEmbeddings(model=model_name)

embeddings = OllamaEmbeddings(model=model_name)

vectors = embeddings.embed_documents(["要嵌入的示例文本"])

vectors = embeddings.embed_documents(["Sample text to embed"])

print(vectors[0][:5]) # 预览前 5 个维度

print(vectors[0][:5]) # preview of first 5 dimensions

语义搜索和关键词搜索相结合的混合搜索

Hybrid search with semantic and keyword

混合检索结合了基于关键词和语义的搜索的优势。最佳匹配25 ( BM25 ) 处理精确的关键词匹配,适用于专有名词和罕见词,而向量搜索则检索上下文相似的内容。LangChain 的HybridRetriever融合了这两种方法,通过同时关注句法和语义相关性来提高准确率和召回率。这种双重方法确保了对各种查询类型的鲁棒性,尤其是在涉及歧义或探索性问题的场景中。通过配置两种检索器的k 值(最佳结果数),可以对搜索行为进行微调,这使得混合检索成为现代高性能 RAG 流水线的重要组成部分。

Hybrid retrieval combines the strengths of both keyword-based and semantic search. Best Matching 25 (BM25) handles exact keyword matches, useful for proper nouns and rare terms, while vector search retrieves contextually similar content. LangChain’s HybridRetriever fuses both methods, increasing accuracy and recall by addressing both syntactic and semantic relevance. This dual approach ensures robustness across diverse query types, especially in scenarios involving ambiguous or exploratory questions. Configuring k (top results) for both retrievers allows fine-tuning of search behavior, making hybrid retrieval an essential component of modern, high-performance RAG pipelines.

BM25 是一种用于信息检索的排序函数,用于评估文档与搜索查询的相关性。它是概率检索模型的一部分,通过考虑词频(单词在文档中出现的频率)、逆文档频率(单词在所有文档中的出现频率)和文档长度归一化,改进了早期模型。BM25 会为查询词频繁出现且在整个语料库中出现频率较低的文档赋予更高的分数,同时对文档长度进行调整。由于其高效性和简洁性,BM25 被广泛应用于搜索引擎和现代检索系统中。

BM25 is a ranking function used in information retrieval to estimate the relevance of documents to a search query. It is part of the probabilistic retrieval model and improves upon earlier models by considering term frequency (how often a word appears in a document), inverse document frequency (how rare a word is across all documents), and document length normalization. BM25 assigns higher scores to documents where query terms appear frequently and are rare in the overall corpus, while adjusting for document length. It is widely used in search engines and modern retrieval systems due to its effectiveness and simplicity:

bm25_retriever = BM25Retriever.from_documents(chunks)

bm25_retriever = BM25Retriever.from_documents(chunks)

bm25_retriever.k = 4

bm25_retriever.k = 4

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

hybrid_retriever = HybridRetriever(vectorstore=vectorstore, bm25_retriever=bm25_retriever)

hybrid_retriever = HybridRetriever(vectorstore=vectorstore, bm25_retriever=bm25_retriever)

  • BM25Retriever适用于基于关键词的搜索(类似于传统搜索引擎)。
  • BM25Retriever is for keyword-based search (like traditional search engines).
  • vector_retriever用于语义相似度(基于嵌入)。
  • vector_retriever is for semantic similarity (based on embeddings).
  • HybridRetriever结合了这两种技术来提高准确率。
  • HybridRetriever combines both to improve accuracy.

除了适用于传统关键词搜索的BM25Retriever之外,LangChain 还支持其他几种检索器,您可以根据 RAG 系统的需求选择使用。接下来,我们将讨论一些实用的检索器及其最佳适用场景。

Aside from BM25Retriever, which is great for traditional keyword-based search, LangChain supports several other retrievers that can be used depending on your RAG system’s needs. Let us discuss a list of useful retrievers and what they are best suited for.

您可以使用的其他寻回犬

Other retrievers you can use

LangChain 提供超越基本向量和关键词搜索的高级检索策略。诸如ContextualCompressionRetriever MultiQueryRetriever SelfQueryRetrieverTimeWeightedVectorStoreRetriever等工具支持摘要、查询多样化和时间加权检索。感知排序。其他检索器,例如ParentDocumentRetrieverEnsembleRetriever,则针对检索器之间的一致性和加权策略进行优化。每个检索器都针对特定的问题,例如长文档、模糊查询、元数据过滤或时间优先级。通过根据用例需求组合或替换检索器,您可以显著提高 RAG 系统的相关性、灵活性和性能,尤其是在复杂多变的聊天或企业知识库环境中,详情如下:

LangChain offers advanced retrieval strategies beyond basic vector and keyword search. Tools like ContextualCompressionRetriever, MultiQueryRetriever, SelfQueryRetriever, and TimeWeightedVectorStoreRetriever enable summarization, query diversification, and time-aware ranking. Others like ParentDocumentRetriever and EnsembleRetriever optimize for coherence and weighted strategies across retrievers. Each retriever targets a unique problem, lengthy documents, vague queries, metadata filtering, or temporal priority. By combining or swapping retrievers based on use case needs, you can greatly enhance your RAG system’s relevance, flexibility, and performance, particularly in complex, evolving chat or enterprise knowledge environments, details as follows:

  • 上下文压缩检索器
  • ContextualCompressionRetriever:

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers import ContextualCompressionRetriever

  • 使用 LLM 对检索到的内容进行汇总或压缩,然后再将其传递给最终提示符。
  • Uses an LLM to summarize or compress retrieved content before passing it to the final prompt.
  • 非常适合减小令牌大小或文档过长的情况。
  • Ideal for reducing token size or when documents are very long.
  • MultiQueryRetriever
  • MultiQueryRetriever:

from langchain.retrievers.multi_query import MultiQueryRetriever

from langchain.retrievers.multi_query import MultiQueryRetriever

  • 使用 LLM 生成查询的多个重述版本。
  • Generates multiple rephrased versions of the query using an LLM.
  • 检索每个变体的结果,提高覆盖率和召回率。
  • Retrieves results for each variation, improving coverage and recall.
  • 非常适合用于探索性问题或处理用户意图不明确的问题。
  • Great for exploratory questions or ambiguous user intent.
  • ParentDocumentRetriever

    from langchain.retrievers import ParentDocumentRetriever

    • 将文档分割成小块以进行向量搜索,但返回较大的父文档以保持上下文。
    • 有助于保持长篇文档的连贯性。
  • ParentDocumentRetriever:

    from langchain.retrievers import ParentDocumentRetriever

    • Splits documents into small chunks for vector search but returns the larger parent document to maintain context.
    • Useful for preserving coherence in long documents.
  • SelfQueryRetriever
  • SelfQueryRetriever:

from langchain.retrievers import SelfQueryRetriever

from langchain.retrievers import SelfQueryRetriever

  • 使用 LLM 根据查询内容生成向量搜索过滤器,例如元数据感知检索。
  • Uses an LLM to generate vector search filters based on query content, like metadata-aware retrieval.
  • 例如:查找 2020 年之后撰写的关于人工智能的文档。
  • Example: Find documents written after 2020 about AI.
  • 时间加权向量存储检索器
  • TimeWeightedVectorStoreRetriever:

from langchain.retrievers import TimeWeightedVectorStoreRetriever

from langchain.retrievers import TimeWeightedVectorStoreRetriever

  • 优先存储对话记忆中的近期互动。
  • Prioritizes recent interactions in conversational memory.
  • 在注重时效性的聊天机器人系统中非常有用。
  • Useful in chatbot-like systems where recency matters.
  • EnsembleRetriever
  • EnsembleRetriever:

from langchain.retrievers import EnsembleRetriever

from langchain.retrievers import EnsembleRetriever

  • 结合多个检索器(例如,BM25 + 向量 + 多查询),并允许您为每个检索器分配权重。
  • Combines multiple retrievers (e.g., BM25 + vector + multi-query) and lets you assign weights to each.
  • 与HybridRetriever相比,它能更好地控制检索策略
  • Offers more control over retrieval strategy than HybridRetriever.

对话记忆缓冲区

Conversation memory buffer

为了在多轮对话中保持上下文关联,LangChain 引入了ConversationBufferMemory 。该内存模块存储完整的聊天历史记录,使语言模型能够有效地处理后续问题并引用之前的查询。它确保回复不仅基于当前问题,还基于之前的互动,从而提高连贯性和用户满意度。这对于聊天机器人和助手来说尤为重要,因为连贯性对它们至关重要。启用return_messages=True 后,用户和 AI 的消息都会被保留,使 RAG 系统能够在不丢失对话状态的情况下维持丰富的、持续的对话。

To maintain context across multi-turn conversations, LangChain introduces ConversationBufferMemory. This memory module stores the full chat history, enabling the language model to handle follow-ups and reference earlier queries effectively. It ensures that responses are grounded not only in the current question but also in prior interactions, improving coherence and user satisfaction. This is especially valuable in chatbots and assistants, where continuity is essential. With return_messages=True, both user and AI messages are preserved, making the RAG system capable of sustaining rich, ongoing dialogues without losing conversational state.

ConversationBufferMemory是 LangChain 中的一个内存类,它将整个对话历史记录存储为字符串缓冲区。它允许语言模型 (LLM) 记住聊天会话中的先前交互,帮助模型在回合之间保持上下文,详情如下:

ConversationBufferMemory is a memory class in LangChain that stores the entire conversation history as a string buffer. It allows a language model (LLM) to remember prior interactions in a chat session, helping the model maintain context across turns, details as follows:

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

  • 它会跟踪对话历史记录,以便在多轮对话中保持上下文。
  • Its keeps track of the conversation history to maintain context in multi-turn dialogues.
  • 聊天机器人理解后续问题至关重要。
  • Important for chatbots to understand follow-up questions.

LLM配置自然语言生成

LLM configuration natural language generation

RAG 系统使用 Ollama 接口加载本地 Mistral 语言模型,并通过温度等参数控制输出行为。较低的温度值(例如0.2 )可生成确定性强、重点突出的响应,非常适合事实性问答或检索任务。由于系统在本地运行,因此可确保隐私和成本效益。该语言模型能够解读检索到的内容和用户问题,并生成结构化且信息丰富的回复。以下模块化配置允许在不同模型之间轻松切换,从而根据您的应用程序需求调整输出风格、速度和准确性,使其成为离线 GenAI 部署的基石:

The RAG system uses the Ollama interface to load the local Mistral language model, controlling output behavior via parameters like temperature. A low temperature value (e.g., 0.2) results in deterministic, focused responses ideal for factual QA or retrieval tasks. By running locally, the setup ensures privacy and cost-efficiency. This language model interprets retrieved content and user questions, generating structured and informative replies. The following modular configuration allows easy switching between models, aligning output style, speed, and accuracy with your application needs—making it a cornerstone of offline, GenAI deployments:

llm = Ollama(model=”mistral”, temperature=0.2)

llm = Ollama(model=”mistral”, temperature=0.2)

  • 通过Ollama在本地加载 Mistral 模型
  • Loads the Mistral model locally via Ollama.
  • 温度=0.2表示输出将更加集中,随机性降低。
  • temperature=0.2 means the output will be focused and less random.

ReAct 提示模板

ReAct prompt template

ReAct 是一种提示策略,它引导学习导师在回答问题之前将问题分解为逻辑步骤。提示模板提供了一种结构:它将推理过程与最终答案分开,从而提高了答案的透明度和可追溯性。通过明确指导该模型采用逐步思考的方式,使LLM的输出与类人问题解决过程保持一致。这种方法通过鼓励模型对检索到的上下文进行有意义的综合,从而提升了知识密集型任务(例如检索增强型问答)的性能。ReAct模板能够使RAG系统中的AI行为更具可解释性、可控性和可信度,详情如下:

ReAct is a prompting strategy that guides LLMs to break down problems into logical steps before answering. The prompt template provides structure: it separates reasoning from the final response, improving transparency and traceability of answers. By explicitly instructing the model to think step-by-step, it aligns LLM output with human-like problem-solving processes. This method boosts performance in knowledge-intensive tasks like retrieval-augmented QA by encouraging the model to synthesize retrieved context meaningfully. ReAct templates enable more explainable, controllable, and trustworthy AI behaviors in RAG systems, details as follows:

react_prompt = PromptTemplate(

react_prompt = PromptTemplate(

input_variables=["context", "question"],

input_variables=["context", "question"],

模板="\"\"

template=\"\"\"

您是一位使用 ReAct(推理 + 行动)技术的智能助手。

You are an intelligent assistant using the ReAct (Reasoning + Acting) technique.

将用户查询分解为推理步骤,并据此检索相关信息。

Break down the user query into reasoning steps and retrieve relevant information accordingly.

问题:{question}

Question: {question}

相关背景:

Relevant Context:

{语境}

{context}

首先,请清晰地列出你的推理步骤。

First, list your reasoning steps clearly.

然后,根据这些步骤和检索到的上下文提供最终答案。

Then, provide a final answer based on those steps and the retrieved context.

推理步骤:

Reasoning Steps:

1.

1.

\"\"\"

\"\"\"

)

  • 指导法学硕士学生运用 ReAct 技术逐步思考。

    鼓励在得出最终答案之前进行逻辑推理。

  • Instructs the LLM to think step-by-step using the ReAct technique.

    Encourages logical reasoning before generating the final answer.

构建对话式问答链

Building the conversational QA chain

对话检索整合了LLM、检索器、内存和提示等核心组件,形成完整的RAG工作流程。它通过保留历史记录支持多轮对话,利用混合搜索检索上下文,使用ReAct提示进行推理,并通过Mistral模型进行响应。这条统一的流程链不仅生成高质量的答案,还能返回所使用的源文档,从而增强透明度和引用性。它是智能助手和文档聊天系统的基石,能够实现动态的、上下文感知的响应。这种抽象简化了LLM驱动应用程序的编排,并鼓励采用模块化、可扩展的设计。

The ConversationalRetrievalChain integrates the core components, LLM, retriever, memory, and prompt, to form a complete RAG workflow. It supports multi-turn dialogue by preserving history, retrieves context with hybrid search, reasons with the ReAct prompt, and responds via the Mistral model. This unified chain not only generates high-quality answers but also returns the source documents used, enhancing transparency and citation. It is the backbone of intelligent assistants and document chat systems, enabling dynamic, context-aware responses. This abstraction simplifies orchestration and encourages modular, scalable design in LLM-powered applications:

qa_chain = ConversationalRetrievalChain.from_llm(

qa_chain = ConversationalRetrievalChain.from_llm(

llm=llm,

llm=llm,

检索者=杂交寻回犬,

retriever=hybrid_retriever,

内存=内存,

memory=memory,

return_source_documents=True,

return_source_documents=True,

combine_docs_chain_kwargs={"prompt": react_prompt}

combine_docs_chain_kwargs={"prompt": react_prompt}

)

它结合了以下要素:

It combines the following:

llm 米斯特拉尔)

The llm (Mistral)

检索(混合搜索)

The retriever (hybrid search)

对话记忆

The conversation memory

• 结构化推理提示

The prompt for structured reasoning

返回答案和源文档(用于引用)

Returns both the answer and the source documents (for citation)

用户聊天循环

User chat loop

最后一个组件是聊天循环,RAG 系统在此循环中持续接收、处理和响应用户输入。该循环捕获用户问题,将其传递到对话式问答链中,并显示答案和来源引用。它支持实时交互和多轮记忆,使其成为聊天机器人、研究助手或文档问答工具的理想选择。通过整合之前的所有组件(检索、生成、记忆和提示),聊天循环使系统充满活力,将静态文档转化为面向最终用户的交互式知识界面:

The final component is the chat loop, where user input is continuously accepted, processed, and responded to by the RAG system. This loop captures user questions, passes them through the conversational QA chain, and displays both the answer and source citations. It supports real-time interaction and multi-turn memory, making it ideal for chatbots, research assistants, or document QA tools. By integrating all prior components, retrieval, generation, memory, and prompting, the chat loop brings the system to life, turning static documents into an interactive knowledge interface for end users:

print("RAG 系统已准备就绪。请就文档提出问题。")

print("RAG System Ready. Ask a question about the document.")

当 True 时:

while True:

查询 = 输入("\n用户: ")

query = input("\nUser: ")

如果 query.lower() 在 ["exit", "quit"] 中:

if query.lower() in ["exit", "quit"]:

休息

break

response = qa_chain({"question": query})

response = qa_chain({"question": query})

print("\n助理:", response["answer"])

print("\nAssistant:", response["answer"])

print("\n来源:")

print("\nSources:")

for doc in response["source_documents"]:

for doc in response["source_documents"]:

print("-", doc.metadata.get("source", "[不含源元数据的块]"))

print("-", doc.metadata.get("source", "[Chunk without source metadata]"))

  • 开始与用户进行互动聊天。
  • Starts an interactive chat with the user.
  • 将用户的问题发送到 RAG 系统。
  • Sends the user's question to the RAG system.
  • 打印LLM的自然语言答案。
  • Prints the LLM's natural language answer.
  • 同时打印出用于生成答案的数据块(来源)。
  • Also prints which chunks (sources) were used to generate the answer.

在前一节中,我们提到了提示信息长度溢出的问题。当提示信息的总长度(包括用户查询、上下文、系统指令和内存)超过语言模型的最大词元容量限制时,就会发生提示信息长度溢出。每个模型(例如 Mistral、Llama 或 GPT)都有一个定义的词元容量(例如 4,096 或 8,000 个词元),超过此限制会导致错误或响应被截断。溢出通常发生在 RAG 系统中,尤其是在提示信息过多时。 单个提示信息中可能包含大量数据块或冗长的对话。为避免这种情况,您可以限制数据块大小、截断旧内存,或使用基于标记的文本分割器和压缩检索器,以确保输入内容在安全范围内。

In the preceding section, we touched upon a challenge called prompt size overflow, it occurs when the combined length of a prompt, including the user query, context, system instructions, and memory, exceeds the maximum token limit of the language model. Each model (like Mistral, Llama, or GPT) has a defined token capacity (e.g., 4,096 or 8,000 tokens), and exceeding this limit causes errors or truncated responses. Overflow often happens in RAG systems when too many large chunks or long conversations are included in a single prompt. To prevent it, you can limit chunk size, truncate older memory, or use token-aware text splitters and compression retrievers to keep input within safe bounds.

正如提示信息溢出会影响 RAG 系统的性能一样,还有其他一些挑战和潜在的故障点需要注意。这些问题通常源于文档分块方式、词嵌入生成方式或检索策略配置方式。如果处理不当,它们会导致不相关的结果、虚假结果或低质量的答案。了解这些痛点对于构建稳健可靠的 RAG 流程至关重要。在下一节中,我们将探讨 RAG 系统中最常见的挑战,并讨论如何在实践中识别和缓解这些挑战。

Just as prompt size overflow can disrupt the performance of a RAG system, there are several other challenges and potential failure points to be aware of. These issues often stem from how documents are chunked, how embeddings are generated, or how retrieval strategies are configured. If not properly addressed, they can lead to irrelevant results, hallucinations, or poor answer quality. Understanding these pain points is crucial for building robust and reliable RAG pipelines. In the next section, we will explore the most common challenges encountered in RAG systems and discuss how to identify and mitigate them in practice.

RAG面临的挑战

Challenges in RAG

这篇题为《构建检索增强生成系统时的七个故障点》的论文强调,实际应用中的RAG系统需要强大的运行时验证——故障不能仅仅在设计阶段预测;它们必须在部署过程中不断演进。该论文为构建可靠系统的实践者提供了宝贵的见解,重点指出了最需要检查点和纠正机制的地方。

The paper, Seven Failure Points When Engineering a Retrieval Augmented Generation System, emphasizes that real-world RAG systems need robust runtime validation - you cannot predict failures solely at design time; they must evolve through deployment. It offers valuable insight for practitioners building reliable systems, highlighting where checkpoints and corrective mechanisms are most needed.

以下是如何在我们当前的 RAG 流程中解决论文中提到的七个 RAG 故障点的方法:

Here is how you can address each of the seven RAG failure points from the paper in our current RAG pipeline:

  • 缺少内容
    • 问题:当语料库中不存在相关信息时,LLM 会捏造答案。
    • 解决方案
      • 在提示中添加备用答案:如果上下文不足以回答,则根据给定的信息回答“我不知道”
      • 添加置信度阈值:使用向量存储中的余弦相似度得分,并拒绝置信度低的结果。
  • Missing content:
    • Problem: The LLM fabricates an answer when the information does not exist in the corpus.
    • Solution:
      • Add a fallback in the prompt: If the context is not sufficient to answer, respond with I don’t know based on the given information.
      • Add a confidence threshold: Use cosine similarity scores from the vector store and reject low-confidence results.
  • 错过的排名靠前的文件
    • 问题:相关文档存在,但不在检索到的前 k 个结果中。
    • 解决方案
      • 使用MultiQueryRetriever生成查询的各种不同版本。
      • 增加k值:

        vectorstore.as_retriever(search_kwargs={"k": 8})

  • Missed top-ranked documents:
    • Problem: Relevant documents are present but not in the top-k retrieved.
    • Solution:
      • Use MultiQueryRetriever to generate diverse reformulations of the query.
      • Increase k in:

        vectorstore.as_retriever(search_kwargs={"k": 8})

  • 不在上下文中
    • 问题:已检索到相关数据块,但由于令牌/提示限制而未包含在内。
    • 解决方案
      • 使用 ContextualCompressionRetriever:

        from langchain.retrievers import ContextualCompressionRetriever

        retriever = ContextualCompressionRetriever(base_compressor=llm, base_retriever=hybrid_retriever)

      • 压缩或概括上下文以适应模型的标记窗口。
  • Not in context:
    • Problem: Relevant chunks are retrieved but not included due to token/prompt limits.
    • Solution:
      • Use ContextualCompressionRetriever:

        from langchain.retrievers import ContextualCompressionRetriever

        retriever = ContextualCompressionRetriever(base_compressor=llm, base_retriever=hybrid_retriever)

      • Compress or summarize context to fit the model's token window.
  • 未提取
    • 问题:LLM 无法从上下文中提取答案。
    • 解决方案
      • 通过明确的指示和逐步推理来提高提示的清晰度(ReAct 已经实现了这一点)。
      • 如果需要,还可以专门针对提取任务对较小的 LLM 进行微调。
  • Not extracted:
    • Problem: The LLM fails to extract the answer from context.
    • Solution:
      • Improve prompt clarity with explicit instructions and step-by-step reasoning (already done via ReAct).
      • You can also finetune a smaller LLM specifically for extraction tasks if needed.
  • 格式错误
    • 问题:LLM 忽略格式指令(例如,表格、JSON)。
    • 解决方案
      • 使用格式提示修改提示:

        如果可以,请将答案格式化为 JSON 对象或项目符号列表。

      • 您还可以使用结构化输出工具,例如 LangChain 的输出解析器。
  • Wrong format:
    • Problem: The LLM ignores formatting instructions (e.g., table, JSON).
    • Solution:
      • Modify the prompt with formatting cues:

        Format your answer as a JSON object or bullet list if possible.

      • You can also use structured output tools like LangChain’s Output Parsers.
  • 特异性错误
    • 问题:输出结果过于笼统或过于详细。
    • 解决方案
      • 让用户定义具体程度:提供高层次的总结性回答,还是提供详细的技术解释。
      • 添加提示模板,接受response_styledetail_level作为变量。
  • Incorrect specificity:
    • Problem: Output is too generic or overly detailed.
    • Solution:
      • Let the user define specificity: Respond in a high-level summary vs. give detailed technical explanation.
      • Add prompt templates that accept response_style or detail_level as a variable.
  • 答案不完整
    • 问题:答案包含了部分必要信息,但遗漏了其他关键事实。
    • 解决方案
      • 使用答案合并逻辑从多个文档中提取内容。
    • 添加清单式推理步骤:
      1. 找出所有相关事实。
      2. 合并并总结。
  • Incomplete answers:
    • Problem: The answer includes part of the required info but omits other key facts.
    • Solution:
      • Use answer merging logic to extract content from multiple documents.
    • Add a checklist-style reasoning step:
      1. Identify all relevant facts.
      2. Combine and summarize.

你目前所学的只是基础 RAG 系统的冰山一角。你已经构建了一个可运行的流程,该流程可以加载文档、将其分块、生成词嵌入、将其存储在向量数据库中,并检索相关的上下文信息以进行基于 LLM 的回答。你还实现了 ReAct 风格的提示和混合搜索。然而,RAG 系统非常复杂,面临着更深层次的挑战,例如提示优化、故障检测、可扩展性和评估。这种基础架构为你探索更高级的主题奠定了基础,例如工具增强型代理、知识图谱、动态路由和自定义检索器,这些主题都能在实际的 GenAI 应用中提供更高的控制力、精确度和灵活性。

What you have learned so far is just scratching the surface of foundational RAG systems. You have built a working pipeline that loads documents, chunks them, generates embeddings, stores them in a vector database, and retrieves relevant context for LLM-based answering. You have also implemented ReAct-style prompting and hybrid search. However, RAG systems are complex, with deeper challenges like prompt optimization, failure detection, scalability, and evaluation. This foundational setup prepares you to explore advanced topics such as tool-augmented agents, knowledge graphs, dynamic routing, and custom retrievers, each offering more control, precision, and flexibility in real-world GenAI applications.

现在,您可以将此 RAG 系统作为基础,开始探索其灵活性。作为课后作业,请尝试修改代码,试验不同的分块/分割策略、嵌入模型、LLM 以及检索或搜索方法。每个组件都是模块化的,可以轻松替换,使您可以根据特定的数据类型、性能需求或准确率目标来定制系统。这种实践性的定制将加深您对每一层如何为 GenAI 流水线的整体性能做出贡献的理解。

You can now take this RAG system as a foundation and begin exploring its flexibility. As a take-home assignment, try modifying the code to experiment with different chunking/splitting strategies, embedding models, LLMs, and retrieval or search methods. Each of these components is modular and easily swappable, allowing you to tailor the system to specific data types, performance needs, or accuracy goals. This hands-on customization will deepen your understanding of how each layer contributes to the overall performance of a GenAI pipeline.

结论

Conclusion

在本章中,我们探索了现代 GenAI 系统的基本构建模块。我们了解了 GPU 在加速 AI 工作负载中的作用,以及如何利用本地 GPU 实现经济高效且注重隐私的替代方案。我们介绍了 Ollama,它是一款能够高效运行本地 LLM 的工具,并详细讲解了 RAG 系统的架构。您还学习了如何使用本地 LLM 生成 PDF 文档,并使用 LangChain、向量数据库和混合检索策略实现了一个完整的 RAG 流水线。最后,我们探讨了 RAG 面临的关键挑战。在下一章中,我们将使用 OpenAI 而非 Ollama 来实现基于 API 的 GenAI 系统。

In this chapter, we explored the essential building blocks of modern GenAI systems. We learned the role of GPUs in accelerating AI workloads and how using a local GPU can be a cost-effective, privacy-friendly alternative. We introduced Ollama as a tool to run local LLMs efficiently and walked through the architecture of RAG systems. You also learned to generate PDF documents using a local LLM, and implemented a complete a RAG pipeline using LangChain, vector databases, and hybrid retrieval strategies. Finally, we examined key challenges in RAG. In the next chapter, we will implement API-based GenAI systems using OpenAI instead of Ollama.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第四章实现基于单模态 API 的 GenAI 系统

CHAPTER 4Implementing Unimodal API-based GenAI Systems

介绍

Introduction

本章在前一章的基础上,进一步探讨了如何使用 Ollama 和 LangChain 实现完全本地化的检索增强生成RAG )系统。前一章侧重于隐私和离线执行,而本章则通过集成 OpenAI API,将重点转移到云端功能。这使我们能够扩展 GenAI 应用,并访问诸如生成式预训练 Transformer GPT )等强大的模型,从而增强推理能力、扩大知识覆盖范围并处理更复杂的查询。我们的目标是扩展 RAG 系统,使其支持多文档查询。

In this chapter, we build upon the foundation laid in the previous chapter, where we implemented a fully local retrieval-augmented generation (RAG) system using Ollama and LangChain. While that approach prioritized privacy and offline execution, this chapter shifts focus to cloud-based capabilities by integrating the OpenAI API. This enables us to scale our GenAI applications with access to powerful models like generative pretrained transformer (GPT) and beyond, allowing for enhanced reasoning, broader knowledge coverage, and more complex query handling. Our goal is to extend the RAG system to support multi-document querying.

我们将探讨如何设计和实现多文档 GenAI 系统。通过结合 OpenAI 的 API 功能和周密的系统设计,您将学习如何构建更具可扩展性、灵活性和智能性的 GenAI 流水线,以适应企业级和云原生环境。

We will explore how to design and implement a multi-document GenAI system. By combining OpenAI’s API capabilities with thoughtful system design, you will learn how to build more scalable, flexible, and intelligent GenAI pipelines suited for enterprise and cloud-native environments.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • OpenAI API 和模型入门
  • Getting started with OpenAI APIs and models
  • 核心 API 端点
  • Core API endpoints
  • 多文档查询
  • Multi-document query
  • 使用 OpenAI 实现模块化 RAG 系统
  • Implementing modular RAG system with OpenAI
  • 待办事项
  • To do

目标

Objectives

本章旨在指导您使用基于 API 的大型语言模型( LLM )构建一个完全基于 API 的单模态 RAG 系统。您将学习如何使用 OpenAI 运行 LLM,使用Facebook AI Similarity Search ( Faiss ) 存储和搜索文档嵌入,以及使用 LangChain 管理检索和生成工作流程。重点在于创建一个可扩展、模块化的 GenAI 流水线,适用于企业级应用。

The objective of this chapter is to guide you through building a fully API-based, unimodal RAG system using API-based large language models (LLMs). You will learn to run LLMs with OpenAI, store and search document embeddings using Facebook AI Similarity Search (Faiss) and manage the retrieval and generation workflow using LangChain. The focus is on creating a scalable, modular GenAI pipeline suitable for an enterprise.

OpenAI API 和模型入门

Getting started with OpenAI APIs and models

OpenAI是全球领先的人工智能研究和部署公司之一。它以其先进的生成模型而闻名,例如GPT、DALL·E(文本到图像生成)、Whisper(语音识别)和Sora(文本到视频生成)。这些模型旨在通过OpenAI API进行访问,使开发者能够构建涵盖文本生成、图像合成、音频转录等多个领域的智能应用。

OpenAI is one of the leading artificial intelligence research and deployment companies in the world. It is best known for its state-of-the-art generative models such as GPT, DALL·E (text-to-image generation), Whisper (speech recognition), and Sora (text-to-video generation). These models are designed to be accessed via the OpenAI API, which allows developers to build intelligent applications across a variety of domains, including text generation, image synthesis, audio transcription, and more.

本节全面概述了 OpenAI、其模型以及提供的各种 API。无论您是构建 RAG 系统、聊天机器人、摘要工具还是多模态应用程序,了解 OpenAI 的产品对于选择适合您项目的工具都至关重要。

This section provides a comprehensive overview of OpenAI, its models, and the different APIs it offers. Whether you are building a RAG system, a chatbot, a summarization tool, or a multimodal application, understanding OpenAI's offerings is essential for choosing the right tools for your project.

OpenAI 公司

OpenAI as a company

OpenAI成立于2015年12月,其使命是确保通用人工智能AGI )造福全人类。OpenAI最初是一家非营利组织,后来转型为盈利上限模式,以吸引资本并继续专注于其使命。

Founded in December 2015, OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. Originally established as a non-profit, OpenAI transitioned into a capped-profit model to attract capital while remaining mission-focused.

OpenAI 以开发功能强大的语言模型而闻名,这些模型能够像人类一样理解和生成文本。自 2019 年发布 GPT-2,2020 年发布 GPT-3,以及后续的 GPT-3.5 和 GPT-4 等迭代版本以来,OpenAI 一直引领着生成式人工智能领域的标杆。

The organization is best known for developing powerful language models that are capable of human-like understanding and generation of text. With the release of GPT-2 in 2019, followed by GPT-3 in 2020, and subsequent iterations including GPT-3.5 and GPT-4, OpenAI has consistently set the benchmark in generative AI.

OpenAI API 概述

Overview of the OpenAI API

OpenAI API 通过 RESTful 接口提供对一系列模型的编程访问。这使得开发者能够将强大的 AI 功能集成到他们的应用程序中。该 API 文档齐全,并可通过官方软件开发工具包( SDK ) 访问,支持 Python、Node.js 等多种语言。

The OpenAI API provides programmatic access to a range of models via RESTful endpoints. This enables developers to integrate powerful AI capabilities into their applications. The API is well-documented and accessible through official software development kits (SDKs) in languages like Python, Node.js, and others.

OpenAI API涵盖的主要功能包括:

The main functionalities covered by the OpenAI API include:

  • 文本生成
  • Text generation
  • 基于聊天的互动
  • Chat-based interactions
  • 代码生成和编辑
  • Code generation and editing
  • 嵌入生成
  • Embedding generation
  • 图像生成与处理
  • Image generation and manipulation
  • 音频转录和翻译
  • Audio transcription and translation
  • 内容审核
  • Content moderation
  • 微调和定制模型
  • Fine-tuning and custom models

核心 API 端点

Core API endpoints

下表总结了 OpenAI API 功能的主要类别,以及它们各自的端点和用例,重点介绍了可用于文本、图像、音频和模型管理操作的广泛功能:

The following table summarize the key categories of OpenAI API functionalities, along with their respective endpoints and use cases, highlighting the broad capabilities available for text, image, audio, and model management operations:

类别

Category

端点

Endpoint(s)

描述

Description

文本生成

Text generation

/v1/completions,/v1/chat/completions

/v1/completions, /v1/chat/completions

生成或继续自然语言文本

Generate or continue natural language text

编辑

Editing

/v1/编辑

/v1/edits

修改现有文本或代码

Modify existing text or code

嵌入

Embeddings

/v1/嵌入

/v1/embeddings

为文本生成矢量表示

Generate vector representations for text

图像生成

Image generation

/v1/images/generations、/v1/images/edits、/v1/images/variations

/v1/images/generations, /v1/images/edits, /v1/images/variations

使用 DALL·E 创建和编辑图像

Create and edit images using DALL·E

音频处理

Audio processing

/v1/audio/transcriptions,/v1/audio/translations

/v1/audio/transcriptions, /v1/audio/translations

将语音转换为文本并翻译

Convert speech-to-text and translate

适度

Moderation

/v1/moderations

/v1/moderations

检测有害或敏感内容

Detect harmful or sensitive content

文件处理

File handling

/v1/文件

/v1/files

上传和管理文件以进行微调

Upload and manage files for fine-tuning

微调

Fine-tuning

/v1/微调

/v1/fine-tunes

使用您自己的数据创建自定义模型

Create custom models using your own data

型号列表

Model listing

/v1/模型

/v1/models

获取可用模型

Retrieve available models

表 4.1:OpenAI API 端点概览表

Table 4.1: OpenAI API endpoint overview tables

这些端点为各种用例提供​​了一套强大的工具包,从构建聊天机器人和摘要器到开发全功能 AI 助手。

These endpoints provide a robust toolkit for a variety of use cases, from building chatbots and summarizers to developing full-scale AI assistants.

OpenAI 主要模型

Major OpenAI models

OpenAI 提供多个模型系列,每个模型系列都针对特定类型的任务而设计。以下是截至 2025 年可用的主要模型概览:

OpenAI offers several families of models, each designed for specific types of tasks. Here is a breakdown of the major models available as of 2025:

  • 文本和聊天模式:这些模式使用最为广泛,包括:
    • GPT-3.5 系列
      • GPT-3.5 涡轮增压
    • GPT-4 系列
      • GPT-4
      • GPT-4o(全多模态)
      • GPT-4o-mini
      • GPT-4.1、GPT-4.1-mini、GPT-4.1-nano
    • O系列(侧重推理)
      • o1、o1-mini、o1-pro
      • o3-mini、o3、o3-mini-high
      • o4-mini,o4-mini-high

    这些模型在性能、成本效益和推理能力方面各有不同,开发人员可以根据其具体的应用程序需求进行选择。

  • Text and chat models: These models are the most used and include:
    • GPT-3.5 family:
      • GPT-3.5 Turbo
    • GPT-4 family:
      • GPT-4
      • GPT-4o (omni multimodal)
      • GPT-4o-mini
      • GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
    • O-series (reasoning focused):
      • o1, o1-mini, o1-pro
      • o3-mini, o3, o3-mini-high
      • o4-mini, o4-mini-high

    Each of these models offers different levels of performance, cost-efficiency, and reasoning capabilities, allowing developers to choose based on their specific application needs.

  • 图像模型:OpenAI 还开发了强大的图像生成模型:
    • DALL·E 系列
      • 达勒
      • DALL·E 2
      • DALL·E 3
    • GPT 图像 1(集成 GPT-4o)

      这些模型允许用户根据文本描述生成和编辑图像,从而实现各种创意和实用应用。

  • Image models: OpenAI has also developed powerful models for image generation:
    • DALL·E series:
      • DALL·E
      • DALL·E 2
      • DALL·E 3
    • GPT Image 1 (integrated with GPT-4o):

      These models allow users to generate and edit images based on textual descriptions, enabling a wide range of creative and practical applications.

  • 音频型号
    • Whisper :OpenAI 的语音转文本模型,能够将口语转录和翻译成文本。
  • Audio models:
    • Whisper: OpenAI's speech-to-text model, capable of transcribing and translating spoken language into text.
  • 视频模型
    • Sora :一种新型的文本转视频生成模型,能够根据文本提示生成短视频片段。
  • Video models:
    • Sora: A newer model for text-to-video generation, capable of producing short video clips from textual prompts.

访问 OpenAI 模型

Accessing OpenAI models

要使用 OpenAI 模型,开发者通常需要执行以下步骤:

To use OpenAI models, developers typically perform the following steps:

1. 创建一个 OpenAI 帐户并获取 API 密钥。

1. Create an OpenAI account and obtain an API key.

2. 选择适合他们任务的模型。

2. Choose a model appropriate for their task.

3. 使用他们选择的编程语言调用相关端点。

3. Call the relevant endpoint using a programming language of their choice.

4. 将响应集成到应用程序逻辑中。

4. Integrate the responses into their application logic.

以下是一个用 Python 列出所有可用模型的示例:

Here is an example in Python to list all available models:

导入 openai

import openai

openai.api_key = "你的api密钥"

openai.api_key = "your-api-key"

models = openai.Model.list()

models = openai.Model.list()

对于 models.data 中的每个模型:

for model in models.data:

print(model.id)

print(model.id)

这有助于您动态获取和使用您有权访问的模型。

This helps you dynamically fetch and utilize the models you have access to.

选择合适的型号

Choosing the right model

在构建应用程序时,模型的选择取决于多种因素,例如以下几点:

When building an application, the choice of model depends on multiple factors, like the following:

  • 性能要求:对于高推理需求,请使用 GPT-4 或 o 系列。
  • Performance requirements: Use GPT-4 or o-series for high reasoning needs.
  • 预算限制:考虑使用 GPT-3.5 或更小的模型,例如 GPT-4o-mini。
  • Budget constraints: Consider using GPT-3.5 or smaller models like GPT-4o-mini.
  • 使用案例
    • 文本生成:GPT 系列
    • 推理密集型任务:O系列
    • 图像生成:DALL·E
    • 转录:耳语
    • 视频:索拉
  • Use case:
    • Text generation: GPT family
    • Reasoning-heavy tasks: O-series
    • Image generation: DALL·E
    • Transcription: Whisper
    • Video: Sora

初学者最佳实践

Best practices for beginners

如果您刚开始使用 OpenAI 模型,以下是一些最佳实践,可帮助您有效地构建模型并避免常见陷阱:

If you are just getting started with OpenAI models, here are some best practices to help you build effectively and avoid common pitfalls:

  • 从预构建模型开始:在进行微调之前,请使用基础 GPT 模型。
  • Start with prebuilt models: Use the base GPT models before venturing into fine-tuning.
  • 监控使用情况:OpenAI 提供用于监控代币使用情况和成本的工具。
  • Monitor usage: OpenAI provides tools for monitoring token usage and costs.
  • 进行广泛测试:提示符设计是性能的关键。尝试多种提示符样式。
  • Test extensively: Prompt engineering is key to performance. Try multiple prompt styles.
  • 保持关注:OpenAI 经常发布新模型和改进。
  • Stay updated: OpenAI frequently releases new models and improvements.

OpenAI 提供强大的模型和 API 生态系统,用于构建 AI 应用。该平台拥有超过 20 种涵盖文本、图像、音频和视频模态的模型,足以支持各种应用场景。无论您是构建基于云的 RAG 系统、多模态助手,还是企业级 GenAI 平台,了解 OpenAI 的产品和服务都是打造高效 AI 解决方案的第一步。

OpenAI offers a powerful ecosystem of models and APIs for building AI-enabled applications. With over 20 models across text, image, audio, and video modalities, the platform is robust enough to support a wide range of use cases. Whether you are building a cloud-based RAG system, a multimodal assistant, or an enterprise-level GenAI platform, understanding OpenAI's offerings is the first step toward creating impactful AI solutions.

通过掌握 OpenAI API,开发者能够创建智能、可扩展且面向未来的应用程序,这些应用程序可以利用当今一些最先进的 AI 功能。

By mastering the OpenAI API, developers unlock the ability to create intelligent, scalable, and future-ready applications that leverage some of the most advanced AI capabilities available today.

从 OpenAI 到智能体人工智能

From OpenAI to agentic AI

随着 OpenAI 模型日趋成熟,其功能已从生成文本扩展到执行多步骤推理和基于工具的任务。OpenAI 最初以 GPT-3 和 GPT-4 等擅长语言理解和生成的模型而闻名,如今已发展出支持更自主、更具交互性的系统的生态系统,为智能体人工智能的出现铺平了道路。

As OpenAI's models have matured, their capabilities have expanded beyond generating text to performing multi-step reasoning and tool-based task execution. Initially known for models like GPT-3 and GPT-4, which excel at language understanding and generation, OpenAI has evolved its ecosystem to support more autonomous and interactive systems—paving the way for agentic AI.

智能体人工智能标志着人工智能从被动文本生成向主动决策、工具使用和自主工作流程的重大转变。随着响应 API 和代理 SDK 的推出,开发者现在可以构建能够推理任务、调用网络搜索或文件检索等工具,并以最小的人工干预协调复杂交互的智能代理。

Agentic AI represents a significant shift from passive text generation to active decision-making, tool use, and autonomous workflows. With the introduction of the Responses API and the Agents SDK, developers can now build intelligent agents capable of reasoning over tasks, invoking tools like web search or file retrieval, and orchestrating complex interactions with minimal intervention.

这一转变体现了OpenAI更广泛的使命,即创建不仅智能,而且实用、适应性强、能够感知上下文的系统。通过诸如Operator(用于浏览器任务)和Codex(用于软件开发)等框架,OpenAI能够构建可在现实世界中行动的智能体,而不仅仅是模拟对话。

This transition reflects OpenAI’s broader mission to create systems that are not only intelligent but also useful, adaptive, and context-aware. Through frameworks like Operator (for browser tasks) and Codex (for software development), OpenAI enables agents that can act in the real-world, not just simulate conversation.

以下部分将探讨一些细节。

The following section explores some of the details.

OpenAI 的智能体 API 生态系统

OpenAI’s agentic API ecosystem

OpenAI 推出了一套强大的工具和 API,用于构建基于代理的系统,统称为代理生态系统。这些接口旨在支持更复杂、更自主的工作流程,使模型能够以结构化的方式进行推理、调用工具、执行任务并与数字环境交互。本节概述了 OpenAI 代理基础架构的核心组件,包括响应 API、代理 SDK、Operator 以及特定领域的代理,例如 Codex。

OpenAI has introduced a powerful set of tools and APIs for building agent-based systems, collectively referred to as the agentic ecosystem. These interfaces are designed to support more complex and autonomous workflows where models can reason, invoke tools, perform tasks, and interact with digital environments in a structured manner. This section provides an overview of the core components of OpenAI’s agentic infrastructure, including the Responses API, the Agents SDK, Operator, and domain-specific agents like Codex.

响应 API

Responses API

响应 API 于 2025 年初发布,是 OpenAI 构建智能体应用程序的主要接口。它扩展了标准聊天补全 API 的功能,只需一次 API 调用即可实现文本推理和工具调用。以及有状态的上下文管理。通过响应 API,开发人员可以协调交互,使模型能够以连贯的顺序执行诸如文件查找、网络搜索或基于工具的计算等任务。

The Responses API, launched in early 2025, serves as OpenAI’s primary interface for building agentic applications. It extends the capabilities of the standard Chat Completions API by enabling a single API call to include not only textual reasoning but also tool invocation and stateful context management. Through the Responses API, developers can orchestrate interactions where the model performs tasks such as file lookups, web searches, or tool-based computations in a coherent sequence.

此 API 支持集成推理和动作循环,因此特别适用于需要动态工作流的应用。它的设计目标是最终取代助手 API,为智能体行为提供更精简、可扩展的基础架构。

This API supports integrated reasoning and action loops, making it particularly useful for applications that require dynamic workflows. It is designed to eventually replace assistant APIs, providing a more streamlined and scalable foundation for agentic behavior.

代理 SDK

Agents SDK

为了支持复杂工作流和多智能体系统的开发,OpenAI 提供了官方的 Agents SDK。该 SDK 同时支持 Python 和 JavaScript/TypeScript,提供了诸如智能体、工具、工作流、防护机制和交接等基本组件。开发者可以使用该 SDK 定义智能体逻辑、管理工具交互,并协调多个 AI 智能体之间的操作。

To support the development of complex workflows and multi-agent systems, OpenAI provides an official Agents SDK. Available in both Python and JavaScript/TypeScript, this SDK offers primitives such as agents, tools, workflows, guardrails, and handoffs. Developers can use the SDK to define agent logic, manage tool interactions, and coordinate actions across multiple AI agents.

该SDK支持以下功能:

The SDK facilitates features such as:

  • 在推理循环中调用工具
  • Tool invocation within reasoning loops
  • 输入/输出验证的防护措施
  • Guardrails for input/output validation
  • 多智能体协作和任务委派
  • Multi-agent collaboration and task delegation
  • 与追踪和评估工具的原生集成
  • Native integration with tracing and evaluation tools

例如,使用 Python SDK,只需极少的设置即可实例化并执行代理:

For example, using the Python SDK, an agent can be instantiated and executed with minimal setup:

f from agents import Agent, Runner

from agents import Agent, Runner

agent = Agent(name="助理", instructions="您是一位乐于助人的助理")

agent = Agent(name="Assistant", instructions="You are a helpful assistant")

result = Runner.run_sync(agent, "写一首关于递归的俳句")

result = Runner.run_sync(agent, "Write a haiku about recursion")

print(result.final_output)

print(result.final_output)

这种抽象化使得开发人员能够专注于业务逻辑,而 SDK 则负责处理编排工作。

This abstraction allows developers to focus on the business logic while the SDK handles orchestration.

操作员

Operator

Operator 是 OpenAI 开发的一款自主智能体,用于执行基于 Web 的任务。Operator 于 2025 年推出,它允许 AI 系统执行诸如浏览网站、填写表单以及与图形用户界面交互等操作。它基于 Responses API 构建,将推理与现实世界的操作连接起来,使智能体能够完成传统上需要人工干预的工作流程。

Operator is an autonomous agent developed by OpenAI for executing web-based tasks. Introduced in 2025, Operator allows AI systems to perform actions such as navigating websites, filling forms, and interacting with graphical user interfaces. It builds on the Responses API to bridge reasoning with real-world action, making it possible for agents to complete workflows that traditionally required human intervention.

此功能对于订单处理、自动化客户支持和表单驱动的工作流程等用例尤其有用,在这些用例中,代理需要操作基于浏览器的界面。

This capability is particularly useful for use cases such as order placement, automated customer support, and form-driven workflows, where the agent needs to operate a browser-based interface.

法典

Codex

Codex 是 OpenAI 专为软件开发而设计的智能体人工智能系统。Codex 于 2025 年 5 月发布,能够生成、调试和执行代码。它不仅限于简单的代码生成,还能让智能体运行测试、进行编辑,并与现有软件系统交互,从而完成用户定义的编程任务。

Codex is OpenAI’s agentic AI system designed specifically for software development. Released in May 2025, Codex is capable of generating, debugging, and executing code. It extends beyond simple code generation by enabling agents to run tests, make edits, and interact with existing software systems to fulfill user-defined programming tasks.

Codex可通过OpenAI的开发者平台访问,也是高级订阅计划的一部分。它与Responses API和Agents SDK无缝集成,支持软件工程、自动化和DevOps等应用场景。

Codex is accessible via OpenAI’s developer platform and as part of higher-tier subscription plans. It integrates seamlessly with the Responses API and Agents SDK, supporting use cases in software engineering, automation, and DevOps.

助手 API

Assistants API

在推出 Responses API 之前,OpenAI 提供 Assistants API(旧版 API),旨在通过结构化的线程式界面实现工具增强型对话。虽然该 API 仍然可用,但正逐步被更灵活、更强大的 Responses API 取代。我们鼓励开发者过渡到新的智能体堆栈,因为未来的开发和支持将围绕 Responses API 和 Agents SDK 展开。

Prior to the Responses API, OpenAI provided the Assistants API (Legacy API) to facilitate tool-augmented conversations within a structured thread-based interface. While still available, this API is being phased out in favor of the more flexible and powerful Responses API. Developers are encouraged to transition to the new agentic stack, as future development and support will center around the Responses API and the Agents SDK.

多文档查询

Multi-document query

在前面的章节中,我们重点构建了每次查询单个文档的系统,这种方法对于范围窄、定义明确的任务非常有效。然而,实际应用往往需要跨多个文档进行推理,以收集上下文信息、比较信息或综合分析洞察。过渡到多文档查询方法使我们的系统能够处理更广泛、更复杂的用户意图。这种转变需要重新思考我们如何对信息进行分块、嵌入和检索,以确保跨不同来源的信息具有相关性和一致性。在接下来的章节中,我们将探讨支持多文档查询的策略,以及如何将它们集成到可扩展的 RAG 流程中。

In earlier chapters, we focused on building systems that query a single document at a time, which is effective for narrow, well-defined tasks. However, real-world applications often require reasoning across multiple-documents to gather context, compare information, or synthesize insights. Transitioning to a multi-document query approach allows our system to handle broader and more complex user intents. This shift involves rethinking how we chunk, embed, and retrieve information, ensuring relevance and coherence across diverse sources. In the following sections, we will explore strategies to support multi-document querying and how to integrate them into a scalable RAG pipeline.

我们将使用以下代码生成多文档:

We will use the following code to generate multi-documents:

导入请求

import requests

from reportlab.lib.pagesizes import LETTER

from reportlab.lib.pagesizes import LETTER

从 reportlab.pdfgen 导入 canvas

from reportlab.pdfgen import canvas

导入 textwrap

import textwrap

导入操作系统

import os

OLLAMA_URL = “http://localhost:11434/api/generate”

OLLAMA_URL = “http://localhost:11434/api/generate”

MODEL_NAME = "llama3.2:3b-instruct-fp16"

MODEL_NAME = "llama3.2:3b-instruct-fp16"

def generate_text(topic, max_words=600):

def generate_text(topic, max_words=600):

提示 = (

prompt = (

请撰写一篇关于“{topic}”的说明性文章,字数约为{max_words}。

f"Write an informative article about '{topic}' with approximately {max_words} words. "

f“文章结构应包括引言、正文和结论。”

f"Structure the article with an introduction, body, and conclusion."

)

response = requests.post(OLLAMA_URL, json={

response = requests.post(OLLAMA_URL, json={

“模型”: MODEL_NAME,

"model": MODEL_NAME,

“提示”:提示,

"prompt": prompt,

“流”:否

"stream": False

})

})

如果 response.status_code == 200:

if response.status_code == 200:

返回 response.json()["response"].strip()

return response.json()["response"].strip()

别的:

else:

raise Exception(f"错误:{response.status_code} - {response.text}")

raise Exception(f"Error: {response.status_code} - {response.text}")

def save_to_pdf(text, filename):

def save_to_pdf(text, filename):

pdf = canvas.Canvas(filename, pagesize=LETTER)

pdf = canvas.Canvas(filename, pagesize=LETTER)

宽度,高度 = 字母

width, height = LETTER

边距 = 50

margin = 50

text_object = pdf.beginText(margin, height - margin)

text_object = pdf.beginText(margin, height - margin)

text_object.setFont("Times-Roman", 12)

text_object.setFont("Times-Roman", 12)

wrapped_lines = []

wrapped_lines = []

for paragraph in text.split("\n"):

for paragraph in text.split("\n"):

wrapped_lines.extend(textwrap.wrap(paragraph, width=90))

wrapped_lines.extend(textwrap.wrap(paragraph, width=90))

wrapped_lines.append("")

wrapped_lines.append("")

对于 wrapped_lines 中的每个行:

for line in wrapped_lines:

text_object.textLine(line)

text_object.textLine(line)

如果 text_object.getY() < margin:

if text_object.getY() < margin:

pdf.drawText(text_object)

pdf.drawText(text_object)

pdf.showPage()

pdf.showPage()

text_object = pdf.beginText(margin, height - margin)

text_object = pdf.beginText(margin, height - margin)

text_object.setFont("Times-Roman", 12)

text_object.setFont("Times-Roman", 12)

pdf.drawText(text_object)

pdf.drawText(text_object)

pdf.save()

pdf.save()

如果 __name__ == "__main__":

if __name__ == "__main__":

主题 = [

topics = [

“可再生能源的未来”

"The Future of Renewable Energy",

“通用人工智能的益处和风险”

"Benefits and Risks of Artificial General Intelligence",

“区块链如何改变金融服务”

"How Blockchain is Transforming Financial Services",

“心理健康意识的重要性”

"The Importance of Mental Health Awareness",

“气候变化及其对全球农业的影响”

"Climate Change and Its Impact on Global Agriculture"

]

]

os.makedirs("generated_articles", exist_ok=True)

os.makedirs("generated_articles", exist_ok=True)

对于 topics 中的 topic:

for topic in topics:

尝试:

try:

print(f"正在生成关于 {topic} 的文章")

print(f"Generating article on: {topic}")

文章 = generate_text(主题)

article = generate_text(topic)

safe_title = topic.lower().replace(" ", "_").replace(",, "").replace(".", "")

safe_title = topic.lower().replace(" ", "_").replace(",", "").replace(".", "")

文件名 = f"generated_articles/{safe_title}.pdf"

filename = f"generated_articles/{safe_title}.pdf"

save_to_pdf(文章, 文件名)

save_to_pdf(article, filename)

print(f"PDF 生成成功:{filename}")

print(f"PDF generated successfully: {filename}")

除异常 e 外:

except Exception as e:

print(f"无法为主题“{topic}”生成文章:{str(e)}")

print(f"Failed to generate article for topic '{topic}': {str(e)}")

使用 OpenAI 实现模块化 RAG

Implementing modular RAG with OpenAI

本节详细介绍了一个模块化的 RAG 系统。该系统结合了 OpenAI 的 GPT-4o 进行质量保证,并使用 Faiss 作为向量数据库以实现高效的文档检索。为了清晰性和可维护性,每个组件都封装在一个独立的模块中。

This section provides a detailed walkthrough of a modular RAG system. It combines OpenAI's GPT-4o for QA with Faiss as the vector database for efficient document retrieval. Each component is encapsulated in a separate module for clarity and maintainability.

下图展示了一个使用 OpenAI 模型进行嵌入和答案生成的元数据感知型多文档 RAG 系统的架构。图中重点展示了如何通过混合检索机制处理查询,该机制结合了向量相似性和元数据过滤,以确保提供准确且与来源相关的响应:

The following figure illustrates the architecture of a metadata-aware multi-document RAG system using OpenAI's models for both embeddings and answer generation. It highlights how queries are processed through a hybrid retrieval mechanism that combines vector similarity with metadata filtering to ensure accurate, source-specific responses:

流程图显示查询进入包含向量嵌入的向量数据库,混合搜索与元数据过滤,以及 OpenAI 生成结果,引用来自外部模型的分块和嵌入的文档。

图 4.1:基于 OpenAI 的元数据过滤混合 RAG 架构

Figure 4.1: Metadata-filtered hybrid RAG architecture using OpenAI

主控制器

Main controller

该脚本是 RAG 系统的用户界面。它加载 RAG 链并进入持续循环以接受用户提问。

This script is the user-facing interface of the RAG system. It loads the RAG chain and enters a continuous loop to accept user questions.

收到输入后,它会调用 RAG 管道来检索相关的文档块并生成自然语言响应。

Upon receiving input, it invokes the RAG pipeline to retrieve relevant document chunks and generate a natural language response.

它会打印最终答案以及生成答案时使用的源文档引用。此脚本确保用户与系统之间无缝交互:

It prints the final answer as well as references to source documents used in the answer generation. This script ensures seamless interaction between the user and the system:

#main.py

#main.py

from orchestrator.rag_chain import get_rag_chain

from orchestrator.rag_chain import get_rag_chain

print("RAG 系统已准备就绪。输入“exit”退出。")

print("RAG System Ready. Type 'exit' to quit.")

invoke_rag_chain = get_rag_chain()

invoke_rag_chain = get_rag_chain()

当 True 时:

while True:

查询 = 输入("\n用户: ")

query = input("\nUser: ")

如果 query.lower() 在 ['exit', 'quit'] 中:

if query.lower() in ['exit', 'quit']:

休息

break

result = invoke_rag_chain(query)

result = invoke_rag_chain(query)

print("\n助理:", result["答案"])

print("\nAssistant:", result["answer"])

print("\n来源:")

print("\nSources:")

for doc in result.get("source_documents", []):

for doc in result.get("source_documents", []):

print("-", doc.metadata.get("source", "[unknown]"))

print("-", doc.metadata.get("source", "[unknown]"))

配置

Configuration

该模块集中管理关键系统常量,例如模型名称、嵌入标识符、API 密钥和文件路径。它定义了源 PDF 和矢量数据库的位置,从而可以轻松调整系统设置而无需修改核心逻辑。通过将这些值集中到一个位置,该脚本确保了一致性,并简化了调试和环境移植。它在使流程易于配置和模块化方面发挥着基础性作用。

This module centralizes key system constants such as model names, embedding identifiers, API keys, and file paths. It defines the location of source PDFs and the vector database, making it easy to adjust the system setup without editing core logic. By consolidating these values in a single location, the script ensures consistency and facilitates easier debugging and environment portability. It plays a foundational role in making the pipeline easily configurable and modular.

#config.py

#config.py

MODEL_NAME = "gpt-4o"

MODEL_NAME = "gpt-4o"

嵌入模型 = "text-embedding-3-small"

EMBEDDING_MODEL = "text-embedding-3-small"

OPENAI_API_KEY = "您的 API 密钥"

OPENAI_API_KEY = "your-api-key"

VECTOR_DB_PATH = "db"

VECTOR_DB_PATH = "db"

SOURCE_DOCS = [

SOURCE_DOCS = [

"data/source_docs/ai_education_article.pdf",

"data/source_docs/ai_education_article.pdf",

"data/source_docs/how_blockchain_is_transforming_financial_services.pdf"

"data/source_docs/how_blockchain_is_transforming_financial_services.pdf"

]

]

嵌入初始化

Embedding initialization

该脚本初始化配置中指定的 OpenAI 嵌入模型。它作为抽象层,将原始文档文本转换为稠密的数值向量嵌入。这些向量随后会被检索引擎用于识别与用户查询语义相近的相关数据块。该模块确保嵌入生成的过程可重用、封装,并且如果后端模型发生变化,也易于替换。

This script initializes the OpenAI embedding model specified in the configuration. It serves as an abstraction layer to convert raw document text into dense numerical vector embeddings. These vectors are later used by the retrieval engine to identify relevant chunks semantically close to user queries. The module ensures that embedding generation is reusable, encapsulated, and easy to swap if the backend model changes.

#embedder.py

#embedder.py

from langchain_openai import OpenAIEmbeddings

from langchain_openai import OpenAIEmbeddings

from config import EMBEDDING_MODEL, OPENAI_API_KEY

from config import EMBEDDING_MODEL, OPENAI_API_KEY

def get_embedding_model():

def get_embedding_model():

返回 OpenAIEmbeddings(

return OpenAIEmbeddings(

模型=嵌入式模型,

model=EMBEDDING_MODEL,

api_key=OPENAI_API_KEY

api_key=OPENAI_API_KEY

)

矢量图库设置

Vector store setup

该模块负责管理 Faiss 向量数据库。它首先检查本地是否已存在向量索引,如果存在则加载该索引,从而避免重复计算。

This module is responsible for managing the Faiss vector database. It first checks if a vector index already exists locally and loads it if available, avoiding redundant computation.

如果索引不存在,它会从文档块生成向量嵌入,并创建一个新的 Faiss 索引。

If the index does not exist, it generates vector embeddings from the document chunks and creates a new Faiss index.

此设置支持 RAG 管道中下游搜索组件的持久性和快速检索。

This setup supports persistence and fast retrieval for downstream search components in the RAG pipeline.

#db_handler.py

#db_handler.py

导入操作系统

import os

from pathlib import Path

from pathlib import Path

from langchain_community.vectorstores import FAISS

from langchain_community.vectorstores import FAISS

from embeddings.embedder import get_embedding_model

from embeddings.embedder import get_embedding_model

from config import VECTOR_DB_PATH

from config import VECTOR_DB_PATH

def get_vectorstore(documents):

def get_vectorstore(documents):

embedding_model = get_embedding_model()

embedding_model = get_embedding_model()

index_file = Path(VECTOR_DB_PATH) / "index.faiss"

index_file = Path(VECTOR_DB_PATH) / "index.faiss"

store_file = Path(VECTOR_DB_PATH) / "index.pkl"

store_file = Path(VECTOR_DB_PATH) / "index.pkl"

如果 index_file.exists() 和 store_file.exists():

if index_file.exists() and store_file.exists():

返回 FAISS.load_local(

return FAISS.load_local(

VECTOR_DB_PATH,

VECTOR_DB_PATH,

嵌入模型,

embedding_model,

allow_dangerous_deserialization=True

allow_dangerous_deserialization=True

)

vectorstore = FAISS.from_documents(

vectorstore = FAISS.from_documents(

文件,

documents,

嵌入=嵌入模型

embedding=embedding_model

)

vectorstore.save_local(VECTOR_DB_PATH)

vectorstore.save_local(VECTOR_DB_PATH)

返回向量存储

return vectorstore

元数据标记

Metadata tagging

该实用脚本会为每个文档块添加元数据,以追踪其源文件名。这些元数据随后用于在检索和响应生成过程中进行归属和过滤。通过标记每个文本块的来源,系统可以确保 RAG 响应的透明度、可追溯性和可解释性。

This utility script enriches each document chunk with metadata that tracks its source filename. This metadata is later used to provide attribution and filtering during retrieval and response generation. By tagging the origin of each text chunk, the system can ensure transparency, traceability, and explainability in RAG responses.

该元数据还支持特定主题的筛选,并通过显示来源信息来提高用户信任度。

This metadata also supports topic-specific filtering and improves user trust by surfacing source information.

#metadata_schema.py

#metadata_schema.py

def add_metadata_to_chunks(chunks, source_name):

def add_metadata_to_chunks(chunks, source_name):

对于 chunks 中的每个 chunk:

for chunk in chunks:

如果不是 chunk.metadata:

if not chunk.metadata:

chunk.metadata = {}

chunk.metadata = {}

chunk.metadata["source"] = source_name

chunk.metadata["source"] = source_name

返回数据块

return chunks

文档加载和分块

Document loading and chunking

该模块负责源 PDF 文档的导入和预处理。它加载每个文件,提取原始文本,然后将内容分割成重叠的、语义相关的块。每个块还会添加元数据,例如源文件名,从而在检索过程中实现更好的可追溯性。这种模块化方法既能为嵌入和检索做好文档准备,又能支持灵活的文档管理。

This module handles the ingestion and preprocessing of source PDF documents. It loads each file, extracts the raw text, and then splits the content into overlapping, semantically relevant chunks. Each chunk is further enriched with metadata such as the source filename, enabling better traceability during retrieval. This modular approach prepares the documents for embedding and retrieval while supporting flexible document management.

#pdf_parser.py

#pdf_parser.py

from langchain_community.document_loaders import PyPDFLoader

from langchain_community.document_loaders import PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

from config import SOURCE_DOCS

from config import SOURCE_DOCS

from vectorstore.metadata_schema import add_metadata_to_chunks

from vectorstore.metadata_schema import add_metadata_to_chunks

导入操作系统

import os

def load_and_chunk_pdfs():

def load_and_chunk_pdfs():

all_chunks = []

all_chunks = []

splitter = RecursiveCharacterTextSplitter(

splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_size=500,

chunk_overlap=50,

chunk_overlap=50,

分隔符=["\n\n", "\n", " ", ""]

separators=["\n\n", "\n", " ", ""]

)

对于 SOURCE_DOCS 中的每个路径:

for path in SOURCE_DOCS:

loader = PyPDFLoader(path)

loader = PyPDFLoader(path)

documents = loader.load()

documents = loader.load()

chunks = splitter.split_documents(documents)

chunks = splitter.split_documents(documents)

source_name = os.path.basename(path)

source_name = os.path.basename(path)

enriched_chunks = add_metadata_to_chunks(chunks, source_name)

enriched_chunks = add_metadata_to_chunks(chunks, source_name)

all_chunks.extend(enriched_chunks)

all_chunks.extend(enriched_chunks)

返回所有块

return all_chunks

杂交寻回犬

Hybrid retriever

该模块基于主题相关性筛选数据块,并结合BM25和向量检索以提高准确率。新增基于关键词-主题映射的筛选步骤,可在创建BM25和向量检索器之前,根据主题动态限制数据块。

This module filters chunks based on topic relevance and combines BM25 and vector retrieval for improved accuracy. Adding a filtering step based on keyword-topic mapping to dynamically restrict chunks by topic before creating BM25 and vector retrievers.

在检索过程中强制执行基于元数据的过滤

Enforce metadata-based filtering during retrieval

这是通过修改检索逻辑来实现的,使其在评分和合并文档之前,先根据元数据对文档进行筛选。以下是模块化实现方法:

This is done by modifying the retriever logic to filter documents by metadata before scoring and combining them. The following is how you achieve it modularly:

#混合搜索.py

#hybrid_search.py

from langchain.retrievers import BM25Retriever, EnsembleRetriever

from langchain.retrievers import BM25Retriever, EnsembleRetriever

def filter_chunks_by_topic(chunks, topic):

def filter_chunks_by_topic(chunks, topic):

topic = topic.lower()

topic = topic.lower()

如果主题中包含“blockchain”或“crypto”:

if "blockchain" in topic or "crypto" in topic:

返回 [c for c in chunks if "blockchain" in c.metadata.get("source", "").lower()]

return [c for c in chunks if "blockchain" in c.metadata.get("source", "").lower()]

elif "education" in topic or "ai" in topic or "artificial intelligence" in topic:

elif "education" in topic or "ai" in topic or "artificial intelligence" in topic:

返回 [c for c in chunks if "education" in c.metadata.get("source", ").lower()]

return [c for c in chunks if "education" in c.metadata.get("source", "").lower()]

别的:

else:

返回数据块

return chunks

def get_hybrid_retriever(chunks, vectorstore, topic=None):

def get_hybrid_retriever(chunks, vectorstore, topic=None):

filtered_chunks = filter_chunks_by_topic(chunks, topic)

filtered_chunks = filter_chunks_by_topic(chunks, topic)

bm25_retriever = BM25Retriever.from_documents(filtered_chunks)

bm25_retriever = BM25Retriever.from_documents(filtered_chunks)

bm25_retriever.k = 4

bm25_retriever.k = 4

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

返回 EnsembleRetriever(

return EnsembleRetriever(

检索器=[bm25_retriever, vector_retriever],

retrievers=[bm25_retriever, vector_retriever],

权重=[0.5, 0.5]

weights=[0.5, 0.5]

)

语言模型

Language model

该模块根据配置设置初始化用于生成响应的核心语言学习模型(LLM),例如 OpenAI 的 GPT-4o。它将模型封装在 LangChain 的抽象层中,以便于与检索链和记忆组件集成。该模型配置为低温模式,以倾向于生成确定性且信息丰富的答案。该组件作为 RAG 系统的生成骨干,能够根据检索到的内容生成类似人类的响应。

This module initializes the core LLM used for response generation, such as OpenAI's GPT-4o, based on configuration settings. It wraps the model inside LangChain’s abstraction to allow easy integration with retrieval chains and memory components. The model is configured with a low temperature to favor deterministic, informative answers. This component serves as the generative backbone of the RAG system, producing human-like responses from the retrieved content.

#generate.py

#generate.py

from langchain_openai import ChatOpenAI

from langchain_openai import ChatOpenAI

从 config 导入 MODEL_NAME、OPENAI_API_KEY

from config import MODEL_NAME, OPENAI_API_KEY

def get_llm():

def get_llm():

返回 ChatOpenAI(

return ChatOpenAI(

model=MODEL_NAME,

model=MODEL_NAME,

温度=0.2,

temperature=0.2,

api_key=OPENAI_API_KEY

api_key=OPENAI_API_KEY

)

提示模板

Prompt template

该模块定义了结构化提示,指导逻辑学习模型(LLM)遵循推理与行动ReAct )范式。它鼓励模型在给出最终答案之前,先列出中间推理步骤。这提高了可解释性,减少了模型的想象,并确保模型的推理与检索到的上下文保持一致。在复杂的查询场景中,它能够实现更透明、更可审计的答案生成过程。

This module defines the structured prompt that instructs the LLM to follow the reasoning and acting (ReAct) paradigm. It encourages the model to first list intermediate reasoning steps before providing a final answer. This improves interpretability, reduces hallucination, and ensures the model aligns its reasoning with the retrieved context. It enables a more transparent and auditable answer generation process in complex query scenarios.

#react_prompt.py

#react_prompt.py

from langchain.prompts import PromptTemplate

from langchain.prompts import PromptTemplate

react_prompt = PromptTemplate(

react_prompt = PromptTemplate(

input_variables=["context", "question"],

input_variables=["context", "question"],

模板=""

template="""

您是一位使用 ReAct(推理 + 行动)技术的智能助手。

You are an intelligent assistant using the ReAct (Reasoning + Acting) technique.

将用户查询分解为推理步骤,并据此检索相关信息。

Break down the user query into reasoning steps and retrieve relevant information accordingly.

问题:{question}

Question: {question}

相关背景:

Relevant Context:

{语境}

{context}

首先,请清晰地列出你的推理步骤。

First, list your reasoning steps clearly.

然后,根据这些步骤和检索到的上下文提供最终答案。

Then, provide a final answer based on those steps and the retrieved context.

推理步骤:

Reasoning Steps:

1.

1.

"""

"""

)

RAG 链组件

RAG chain assembly

这是 RAG 系统中所有组件的连接编排层。它负责加载和预处理文档、构建或加载向量存储、配置 LLM 和检索器,并将它们绑定到一个统一的管道中。它会根据用户的查询动态构建混合检索器,以提高检索的相关性。该函数返回一个可调用接口,该接口可以端到端地处理用户问题,生成具有源可追溯性的高质量答案。

This is the orchestration layer that wires together all components in the RAG system. It loads and preprocesses documents, builds or loads the vector store, configures the LLM and retriever, and binds them into a unified pipeline. It dynamically builds a hybrid retriever based on the user’s query to enhance retrieval relevance. The function returns a callable interface that processes user questions end-to-end, generating high-quality answers with source traceability.

#rag_chain.py

#rag_chain.py

from utils.pdf_parser import load_and_chunk_pdfs

from utils.pdf_parser import load_and_chunk_pdfs

from vectorstore.db_handler import get_vectorstore

from vectorstore.db_handler import get_vectorstore

from retriever.hybrid_search import get_hybrid_retriever

from retriever.hybrid_search import get_hybrid_retriever

from llm.generate import get_llm

from llm.generate import get_llm

从 memory.conversation_buffer 导入 memory

from memory.conversation_buffer import memory

from llm.react_prompt import react_prompt

from llm.react_prompt import react_prompt

from langchain.chains import ConversationalRetrievalChain

from langchain.chains import ConversationalRetrievalChain

def get_rag_chain():

def get_rag_chain():

chunks = load_and_chunk_pdfs()

chunks = load_and_chunk_pdfs()

vectorstore = get_vectorstore(chunks)

vectorstore = get_vectorstore(chunks)

llm = get_llm()

llm = get_llm()

def invoke_rag_chain(query: str):

def invoke_rag_chain(query: str):

hybrid_retriever = get_hybrid_retriever(chunks, vectorstore, topic=query)

hybrid_retriever = get_hybrid_retriever(chunks, vectorstore, topic=query)

rag = ConversationalRetrievalChain.from_llm(

rag = ConversationalRetrievalChain.from_llm(

llm=llm,

llm=llm,

检索者=杂交寻回犬,

retriever=hybrid_retriever,

内存=内存,

memory=memory,

return_source_documents=True,

return_source_documents=True,

combine_docs_chain_kwargs={"prompt": react_prompt},

combine_docs_chain_kwargs={"prompt": react_prompt},

output_key="answer"

output_key="answer"

)

return rag.invoke({"question": query})

return rag.invoke({"question": query})

返回 invoke_rag_chain

return invoke_rag_chain

会话记忆

Conversational memory

该模块配置一个内存缓冲区,用于跨多个回合保留过去的用户查询和助手回复。它使系统能够延续对话上下文,从而实现以下目标:后续问题更加连贯,也更能理解上下文。通过存储交互历史记录,它将助手转变为真正交互式的对话代理。这对于保持用户会话的连续性以及提升整体用户体验至关重要。

This module configures a memory buffer to retain past user queries and assistant responses across multiple turns. It enables the system to carry forward conversational context, making follow-up questions more coherent and contextually aware. By storing interaction history, it transforms the assistant into a truly interactive and conversational agent. This is critical for maintaining continuity in user sessions and improving the overall user experience.

#conversation_buffer.py

#conversation_buffer.py

from langchain.memory import ConversationBufferMemory

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(

memory = ConversationBufferMemory(

memory_key="chat_history",

memory_key="chat_history",

return_messages=True,

return_messages=True,

output_key="answer"

output_key="answer"

)

依赖关系

Dependencies

此文件列出了安装和运行 RAG 系统所需的所有 Python 包。它包括 LangChain 模块、OpenAI SDK、Faiss 等矢量数据库库以及 PDF 处理工具。该文件支持使用`pip install -r requirements.txt`命令轻松启动环境。维护此文件可确保跨团队或部署的可复现性、可移植性和协作性。

This file lists all required Python packages needed to install and run the RAG system. It includes LangChain modules, OpenAI SDKs, vector database libraries like Faiss, and PDF processing tools. The file allows easy environment bootstrapping using pip install -r requirements.txt. Maintaining this file ensures reproducibility, portability, and collaboration across teams or deployments.

#requirements.txt

#requirements.txt

langchain

langchain

langchain-community

langchain-community

langchain-openai

langchain-openai

faiss-cpu

faiss-cpu

报告实验室

reportlab

rank_bm25

rank_bm25

pypdf

pypdf

openai

openai

请使用以下命令进行安装:

Use the following command to install them:

pip install -r requirements.txt

pip install -r requirements.txt

这种模块化架构提升了可扩展性、可维护性和可重用性。每个组件都只负责一项职责,因此可以根据需要更轻松地替换模型、更改检索机制或更新文档管道。

This modular architecture promotes scalability, maintainability, and reusability. Each component has a single responsibility, making it easier to swap out models, change the retrieval mechanism, or update the document pipeline as needed.

您可以放心运行您的 RAG 应用,并期待获得以下结果:

You can confidently run your RAG app and expect this:

  • 问:区块链如何改善支付?|仅摘自区块链 PDF。
  • Asking how does blockchain improve payments? | pulls only from the blockchain PDF.
  • 问:人工智能如何实现个性化学习?|仅摘自人工智能教育 PDF。
  • Asking how can AI personalize learning? | pulls only from the AI education PDF.

待办事项

To do

在我们目前的实现方案中,RAG 系统是单租户的,这意味着它在单个共享环境中处理所有数据和用户交互。所有源文档都嵌入其中。将所有数据存储到一个单一的向量存储库中,检索过程在同一个共享文档索引上进行,无论哪个用户提交查询。

In our current implementation, the RAG system is single-tenant, meaning it handles all data and user interactions within a single shared environment. All source documents are embedded into a single vector store, and the retrieval process operates across the same shared document index regardless of which user submits a query.

相比之下,多租户 RAG 系统必须强制执行租户、组织、部门或个人用户之间的数据隔离。每个租户都将拥有自己独立的向量存储或共享存储中的命名空间,从而确保一个用户的数据和结果永远不会暴露给其他用户。系统必须在每次查询期间根据租户身份动态加载正确的向量存储和内存上下文。

In contrast, a multi-tenant RAG system must enforce data isolation between tenants, organizations, departments, or individual users. Each tenant would have its own isolated vector store or a namespace within a shared store, ensuring that one user’s data and results are never exposed to another. The system must dynamically load the correct vector store and memory context based on the tenant identity during each query.

使用本书 GitHub 代码库中提供的现有代码,并分析需要进行哪些具体的架构变更才能将此单租户 RAG 系统转换为安全、可扩展的多租户系统。重点关注向量存储分离、内存管理和请求级路由。此外,还可以选择性地说明用户身份验证或元数据标记如何支持这些变更。

Use the existing code, available in the GitHub repo of this book, and what specific architectural changes would be required to transform this single-tenant RAG into a secure, scalable multi-tenant system. Focus on vector store separation, memory handling, and request-level routing. Optionally, mention how user authentication or metadata tagging could support these changes.

结论

Conclusion

本章简明扼要地介绍了如何使用 OpenAI 技术构建高级 RAG 系统。我们首先阐述了核心概念和 API,这些概念和 API 能够将语言模型无缝集成到实际应用场景中。然后,我们探讨了向智能体 AI 的演进,即能够进行推理和执行任务的自主系统,这标志着交互方式从静态交互转向动态、自适应的工作流程。

This chapter offered a concise yet thorough guide to building advanced RAG systems using OpenAI technologies. We began with core concepts and APIs that enable seamless integration of language models into real-world use cases. We then explored the evolution toward agentic AI, autonomous systems capable of reasoning and executing tasks, which marks a shift from static interactions to dynamic, adaptive workflows.

本次研究的重点是多文档查询,这对于聚合来自不同来源的上下文信息至关重要。我们提出了一种模块化、可扩展的 RAG 架构,该架构将 OpenAI 模型与 Faiss 相结合,实现混合检索,从而获得高相关性、灵活性和企业级性能。

A key focus was multi-document querying, essential for aggregating context from diverse sources. We presented a modular, scalable RAG architecture that combines OpenAI models with Faiss for hybrid retrieval, enabling high relevance, flexibility, and enterprise-grade performance.

下一章我们将探讨具有人机交互能力的智能体GenAI。它将指导读者构建能够进行检索、推理、行动和交互的决策感知型智能体。这包括集成工具使用、反馈循环和多智能体协作,并将RAG扩展到具有人类监督的动态交互式系统中。

In the next chapter, we will understand agentic GenAI with human-AI interaction. It will guide readers through building decision-aware agents that retrieve, reason, act, and interact. This includes integrating tool use, feedback loops, and multi-agent collaboration, extending RAG into dynamic, interactive systems with human oversight.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第五实现具有人机交互的智能基因人工智能系统

CHAPTER 5Implementing Agentic GenAI Systems with Human-in-the-loop

介绍

Introduction

随着生成式人工智能GenAI )不断超越简单的查询-响应模式,智能体生成式人工智能(agentic GenAI)作为一种强大的架构方法应运而生,它能够实现结构化、动态和自主的推理。与传统的单步响应模型不同,智能体生成式人工智能系统旨在通过多个推理步骤进行规划、信息检索、工具运用和决策制定。本章将向读者介绍构建此类系统的基础概念,重点关注模块化、可扩展和多智能体架构。本章借鉴现实世界的模式,从顺序智能体到分层规划器,为构建能够像协调者一样思考和行动的智能体提供了全面的指导。

As generative AI (GenAI) continues to evolve beyond simple query-response paradigms, agentic GenAI emerges as a powerful architectural approach that enables structured, dynamic, and autonomous reasoning. Unlike traditional models that respond in a single-step, agentic GenAI systems are designed to plan, retrieve information, utilize tools, and make decisions across multiple reasoning steps. This chapter introduces readers to the foundational concepts of building such systems, focusing on modular, extensible, and multi-agent architectures. Drawing on real-world patterns, from sequential agents to hierarchical planners, this chapter provides a comprehensive guide to engineering agents that think and act like orchestrators.

您将学习如何使用 LangChain 的 ReAct 框架、LangGraph 和检索组件等工具来实现智能多智能体系统。这些智能体可以与 API 交互、查询向量数据库、利用内存,甚至可以与人机协作HITL )。我们将使用 Python 将聚合器、循环和路由模式等可视化框架映射到代码,让您深入了解这些抽象概念的实现方式。通过掌握这些智能体模式和设计原则,您将能够开发出不仅能够生成信息,还能进行推理、检索和有目的地响应的 AI 系统。

You will learn how to use tools like LangChain’s ReAct framework, LangGraph, and retrieval components to implement intelligent multi-agent systems. These agents can interact with APIs, query vector databases, utilize memory, and even collaborate with human-in-the-loop (HITL). Visual frameworks such as aggregator, loop, and router patterns will be mapped to code using Python, giving you practical insight into how these abstract ideas are realized. By mastering these agentic patterns and design principles, you will gain the ability to develop AI systems that do not just generate, but reason, retrieve, and respond with purpose.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 构建智能体GenAI系统
  • Architecting agentic GenAI systems
  • 端到端的人机交互 RAG 工作流程
  • End-to-end human-in-the-loop RAG workflow
  • 从 HITL 到多智能体人机交互 RAG
  • From HITL to multi-agent human-in-the-loop RAG
  • 智能体人工智能与人工智能代理
  • Agentic AI vs. AI agents

目标

Objectives

本章旨在通过探索智能体生成人工智能(GenAI)系统的架构、设计模式和实际应用,帮助读者深入理解智能体生成人工智能系统。读者将学习如何构建能够实现推理、工具使用、记忆集成和协作的多智能体工作流。本章还将介绍人机交互检索增强生成( HITL 检索增强生成, RAG )系统,并将传统人工智能智能体与智能体人工智能进行对比,重点强调编排和自适应规划。最终,读者将能够设计超越单步响应的智能系统,为在动态的真实环境中构建可扩展的自主人工智能应用奠定基础。

The objective of this chapter is to equip readers with a deep understanding of agentic GenAI systems by exploring their architecture, design patterns, and practical implementations. Readers will learn how to build multi-agent workflows that enable reasoning, tool use, memory integration, and collaboration. The chapter also introduces HITL retrieval-augmented generation (RAG) systems and contrasts traditional AI agents with agentic AI, emphasizing orchestration and adaptive planning. By the end, readers will be able to design intelligent systems that move beyond single-step responses, laying the foundation for scalable, autonomous AI applications in dynamic, real-world environments.

构建智能体GenAI系统

Architecting agentic GenAI systems

第二章“深入探索多模态系统”中,我们介绍了多智能体系统的概念——即自主人工智能智能体协作解决复杂任务的系统。本节将探讨构成此类系统核心的设计模式。这些模式对于构建智能、模块化和可扩展的GenAI应用至关重要。通过理解和应用这些模式,开发者可以超越一次性生成模型,构建真正动态的、具有智能体的系统,使其能够进行规划、推理、检索、行动和学习。

In Chapter 2, Deep Dive into Multimodal Systems, we introduced the concept of multi-agent systems—systems where autonomous AI agents collaborate to solve complex tasks. In this section, we explore the design patterns that form the backbone of such systems. These patterns are essential for building intelligent, modular, and scalable GenAI applications. By understanding and applying them, developers can move beyond one-shot generation models and architect truly dynamic, agentic systems capable of planning, reasoning, retrieving, acting, and learning.

多智能体系统代表着人工智能系统从单体架构向分布式交互式架构的重大转变。这些系统中的每个智能体都可以是专业化的、自主的或相互依赖的,它们通过共享内存、工具和推理路径来协作完成复杂的流程。在实践中,这些系统是通过组合可重用的设计模式构建的,这些模式定义了智能体之间以及智能体与环境之间的交互方式。在接下来的章节中,我们将探讨经典和高级的设计模式,涵盖从简单的顺序流程到协作式、容错式和多模态推理系统。

Multi-agent systems represent a significant shift from monolithic AI systems to distributed, interactive architectures. Each agent in these systems can be specialized, autonomous, or interdependent, contributing to sophisticated workflows through shared memory, tools, and reasoning paths. In practice, these systems are built by combining reusable design patterns that define how agents interact with one another and the environment. In the following sections, we examine both classical and advanced patterns, ranging from simple sequential flows to collaborative, fault-tolerant, and multimodal reasoning systems.

平行模式

Parallel pattern

并行模式构建了多个人工智能代理,使它们能够同时处理相同的输入或更大输入的不同组成部分。每个代理独立执行其任务。不受他人影响,最终结果是通过合并或汇总他们的个人成果而获得的。

The parallel pattern structures multiple AI agents to operate concurrently on either the same input or different components of a larger input. Each agent performs its task independently, without being influenced by others, and the final result is obtained by merging or aggregating their individual outputs.

结构与行为:所有代理同时触发。它们可以使用相同的输入(例如,共享的用户提示)或分段输入(例如,拆分的文档)。处理完成后,合并函数会将结果聚合为统一的输出。

Structure and behavior: All agents are triggered simultaneously. They may use the same input (e.g., a shared user prompt) or segmented parts (e.g., split documents). After processing, a merge function aggregates results into a unified output.

设计原理:当任务可以分解成独立的工作单元时,这种模式尤为有效。它通过并行处理最大限度地提高速度,并能充分利用不同智能体之间的专业化优势。

Design rationale: This pattern is particularly effective when tasks can be decomposed into independent units of work. It maximizes speed through parallelism and can exploit specialization among agents.

实际应用

Practical application:

  • 对文档运行不同的摘要技术,并选择最佳摘要方法。
  • Running different summarization techniques on a document and selecting the best one.
  • 同时进行多语言翻译。
  • Performing multi-language translation simultaneously.
  • 对同一条信息并行进行情感分析、意图分析和主题分析。
  • Running sentiment, intent, and topic analysis in parallel on the same message.

下图展示了一个并行代理编排工作流程,其中基于 LLM 的中央编排器将输入任务分配给专门的代理进行并行处理,然后再生成最终输出:

The following figure illustrates a parallel agentic orchestration workflow where a central LLM-based orchestrator distributes input tasks to specialized agents for parallel processing before generating the final output:

流程图展示了输入如何到达协调器(LLM),协调器随后分成两个代理机器人,最终都输出到目标位置。箭头指示了从输入到代理再到输出的流程。

图 5.1:LLM 协调器将输入路由到代理以进行协作输出

Figure 5.1: LLM orchestrator routes input to agents for collaborative output

序列模式

Sequential pattern

这种顺序模式将各个代理连接成一个管道,其中每个代理的输出都成为下一个代理的输入。这就创建了一个多步骤的推理或转换过程。

The sequential pattern connects agents in a pipeline, where each agent's output becomes the next agent's input. This creates a multi-step reasoning or transformation process.

结构与行为:智能体 A |智能体 B |智能体 C,形成清晰有序的执行链。每一步都以前一步为基础,通常会增加抽象程度或细化输出。

Structure and behavior: Agent A | Agent B | Agent C, forming a clear, ordered chain of execution. Each step builds on the last, often increasing abstraction or refining output.

设计原理:适用于需要分层处理的工作流程。每个代理都可以执行简单的任务,从而形成易于管理、测试和解释的步骤。

Design rationale: Useful for workflows requiring layered processing. Each agent can perform a simple task, resulting in manageable, testable, and interpretable steps.

实际应用

Practical application:

  • 一个代理检索数据,另一个代理汇总数据,第三个代理格式化数据。
  • An agent retrieves data, another summarizes it, and a third formats it.
  • 文本生成后进行语法纠错和语气调整。
  • Text generation followed by grammar correction and tone adjustment.
  • 数据提取|分类|存储。
  • Data extraction | classification | storage.

该图显示了顺序代理协作模式,其中 LLM 协调器将输入路由到一个代理,该代理将部分任务委托给另一个代理,然后生成最终输出:

This figure shows a sequential agent collaboration pattern where an LLM orchestrator routes input through one agent, which delegates part of the task to another agent before producing the final output:

流程图展示了从协调器(LLM)到两个标记为“代理”的机器人的输入,两个代理的箭头均指向输出。其中一个代理接收来自协调器的输入,两个代理进行交互。

图 5.2:基于 LLM 编排的链式代理协作

Figure 5.2: Chained agent collaboration with LLM orchestration

循环图案

Loop pattern

在循环模式中,智能体通过反馈机制迭代处理输入。系统在每个循环周期中重新评估或改进结果,持续进行直至满足收敛条件。

In the loop pattern, agents iteratively process an input through a feedback mechanism. The system re-evaluates or refines results in each loop cycle, continuing until a convergence condition is met.

结构和行为:代理 A 产生输出 | 代理 B 评估 | 返回反馈 | 重复直到达到质量阈值或循环计数结束。

Structure and behavior: Agent A produces output | Agent B evaluates | feedback is returned | repeat until quality threshold is met or loop count ends.

设计理念:非常适合需要迭代改进、优化或从反馈中学习的任务。鼓励精细化改进,而非一次性生成。

Design rationale: Ideal for tasks involving iterative improvement, optimization, or learning from feedback. Encourages refinement over one-shot generation.

实际应用

Practical application:

  • 写作助理会根据批评意见对草稿进行修改。
  • A writing assistant loops through draft revisions based on critique.
  • 生成代理生成响应,而评估代理提供高质量的反馈。
  • Generative agent creates responses, while an evaluation agent provides quality feedback.
  • 代码生成和测试循环,直到发现所有错误为止。
  • Code generation and testing loop until no bugs are found.

该图表示一个循环代理交互过程,其中代理在 LLM 协调器的指导下,通过迭代沟通协作改进结果,最终生成最终输出:

This figure represents a looped agent interaction, where agents collaboratively refine results through iterative communication before producing the final output, all under the direction of an LLM orchestrator:

图示为输入信息到达标记为LLM的协调器,该协调器将动作分配给两个机器人代理。代理之间相互交互,然后生成输出。箭头表示各元素之间的流程。

图 5.3:LLM 引导的智能体循环,用于迭代任务求解

Figure 5.3: LLM-guided agent loop for iterative task solving

路由器图案

Router pattern

该模式引入了一个中央路由代理,它根据内容、上下文或元数据动态地决定哪个下游代理应该处理传入的任务。

This pattern introduces a central router agent that dynamically decides which downstream agent should handle an incoming task based on content, context, or metadata.

结构和行为:路由器接收输入|分类或分析|发送给多个专门代理之一|返回结果。

Structure and behavior: Router receives input | classifies or analyzes | sends to one of many specialized agents | result is returned.

设计理念:支持模块化和条件逻辑。通过将决策与任务执行分离,提高了代码的可重用性和系统灵活性。

Design rationale: Supports modularity and conditional logic. By separating decision-making from task execution, it promotes reusability and system flexibility.

实际应用

Practical application:

  • 将财务查询转交给税务代理人、预算代理人或投资代理人。
  • Routing finance queries to tax agents, budgeting agents, or investment agents.
  • 多模态输入分类(文本/图像/音频),然后进行专门处理。
  • Multimodal input classification (text/image/audio) followed by specialized processing.
  • 服务台代理将系统路由至技术支持或计费系统。
  • Helpdesk agent routing to technical support or billing systems.

以下架构在 LLM 协调器和下游代理之间引入了一个路由代理,从而能够根据输入特征进行智能任务委派:

The following architecture introduces a router agent between the LLM orchestrator and downstream agents, enabling smart task delegation based on input characteristics:

示意图显示输入流先到达协调器,然后通过 LLM 到达路由代理,路由代理将流导向两个代理之一,最终都输出到目标。箭头指示各步骤之间的流向。

图 5.4:路由代理将任务定向到专门的代理以生成输出。

Figure 5.4: Router agent directs tasks to specialized agents for output generation

聚合器模式

Aggregator pattern

聚合器模式将来自多个来源的输入或输出合并成一个连贯的结果。它关注的不是并行执行,而是综合和整合。

The aggregator pattern combines inputs or outputs from multiple sources into a coherent result. It focuses not on parallel execution but on synthesis and consolidation.

结构和行为:多个输入|聚合代理|规范化、合并或汇总数据|返回单个输出。

Structure and behavior: Multiple inputs | aggregator agent | normalizes, merges, or summarizes data | returns single output.

设计原理:当需要多种视角或数据来源以获得全面结果时,此方法非常有用。通过冗余设计提高模型的稳健性。

Design rationale: Useful when diverse perspectives or data sources are required for a comprehensive output. Promotes robustness through redundancy.

实际应用

Practical application:

  • 整合来自不同知识库的答案,形成完整的回答。
  • Combining answers from different knowledge bases to form a complete response.
  • 将各种指标输出合并到单个仪表板摘要中。
  • Merging various metric outputs into a single dashboard summary.
  • 用于决策的投票模型或集成模型。
  • Voting or ensemble models for decision-making.

下图描绘了一种聚合器模式,其中多个输入由基于 LLM 的协调器统一,然后再传递给代理进行最终处理:

The following figure depicts an aggregator pattern, where multiple inputs are unified by an LLM-based orchestrator before being passed to an agent for final processing:

图示为两个输入箭头指向标有“Orchestrator”(带图标)的节点,该节点将 LLM 发送到 Agent 机器人,最终指向输出箭头。

图 5.5:协调器聚合多个输入以实现统一的代理执行

Figure 5.5: Orchestrator aggregates multiple inputs for unified agent execution

网络模式

Network pattern

在这种模式下,各个代理完全或部分连接,并在没有集中控制的情况下自由通信。它体现了一种去中心化的网状拓扑结构。

In this pattern, agents are fully or partially connected and communicate freely without centralized control. It reflects a decentralized, mesh-like topology.

结构与行为:智能体与网络中的任何对等体交换信息,形成一个开放的协作环境。协调是涌现的。

Structure and behavior: Agents exchange messages with any peer in the network, forming an open, collaborative environment. Coordination is emergent.

设计理念:最适合需要自主性、适应性和同伴学习的复杂环境。适用于分布式问题解决。

Design rationale: Best for complex environments requiring autonomy, adaptability, and peer learning. Suited for distributed problem-solving.

实际应用

Practical application:

  • 代表利益相关者进行合同谈判的人工智能代理。
  • AI agents representing stakeholders negotiating a contract.
  • 具备车对车通信功能的自动驾驶模拟。
  • Autonomous driving simulations with vehicle-to-vehicle communication.
  • 对等系统中数据分类的分布式共识。
  • Distributed consensus on data classification in peer systems.

下图展示了一种网络模式,其中 LLM 协调器激活了一个协作代理网,每个代理都对彼此的输出做出贡献并在此基础上进行构建:

The following figure illustrates a network pattern where an LLM orchestrator activates a collaborative mesh or network of agents, each contributing to and building upon one another’s outputs:

图示为输入信息到达标记为“LLM”的协调器,协调器指导各个代理(机器人图标)相互交互,最终产生输出。箭头表示信息通过代理从输入流向输出的过程。

图 5.6:联网代理协作以生成更丰富的输出

Figure 5.6: Networked agents collaborate for enriched output generation

层级模式

Hierarchical pattern

层级模式将智能体组织成不同抽象层次。高层智能体(规划者或监督者)将任务委派给中层或低层智能体,由后者执行任务。

The hierarchical pattern organizes agents into layers of abstraction. High-level agents (planners or supervisors) delegate tasks to mid or low-level agents, who execute them.

结构和行为:规划代理 | 任务委派 | 工作代理 | 聚合输出向上返回层次结构。

Structure and behavior: Planner agent | task delegation | worker agents | aggregated output returned up the hierarchy.

设计理念:有助于明确责任和控制权,支持任务分解和团队协作。

Design rationale: Encourages clarity of responsibility and control. Supports task decomposition and team-like collaboration.

实际应用

Practical application:

  • 聊天机器人管理器将任务委派给知识检索、格式化和验证子代理。
  • A chatbot manager delegates to knowledge retrieval, formatting, and validation sub-agents.
  • 项目代理人负责监督多个专门的子流程(例如,翻译、摘要、引用)。
  • Project agent overseeing multiple specialized subprocesses (e.g., translation, summarization, citation).

这种架构代表了一种分层模式,其中协调器通过协调代理路由输入,然后协调代理将任务委派给专门的代理以实现并行输出:

This architecture represents a hierarchical pattern where an orchestrator routes input through a coordinating agent, which then delegates tasks to specialized agents for parallel outputs:

手绘示意图,展示了输入信号如何到达协调器LLM,该协调器LLM指挥多个机器人代理,每个代理产生一个输出。代理之间传递任务和输出。

图 5.7:用于分布式输出生成的分层代理链

Figure 5.7: Hierarchical agent chain for distributed output generation

人机交互模式

Human-in-the-loop pattern

这种模式在关键时刻将人类决策引入系统,允许代理暂停并等待用户输入或验证后再继续。

This pattern introduces human decision-making into the system at critical junctures, allowing agents to pause and await user input or validation before continuing.

结构和行为:代理执行停止|人工审核或提供输入|执行恢复。

Structure and behavior: Agent execution halts | human reviews or provides input | execution resumes.

设计原理:在敏感领域(法律、医疗保健、伦理)中,为了安全、正确或规范,需要进行人为监督,因此至关重要。

Design rationale: Essential in sensitive domains (legal, healthcare, ethics) where human oversight is required for safety, correctness, or regulation.

实际应用

Practical application:

  • 人工智能起草合同条款,然后等待人工审批。
  • AI drafting a contract clause, then awaiting human approval.
  • 人工审核被标记的内容。
  • Human moderation of flagged content.
  • 将无法解决的聊天机器人查询转交给人工客服。
  • Escalating unresolved chatbot queries to human agents.

该图突出了一个 HITL 代理框架,其中 LLM 协调的代理生成多个输出,而人则提供监督和最终判断:

This figure highlights a HITL agent framework, where the LLM-orchestrated agents generate multiple outputs and the human provides oversight and final judgment:

该图展示了人类输入的信息传递给协调器,协调器通过LLM将任务发送给AI代理。代理返回输出,部分输出会发送给人类进行审核,其余输出则直接作为最终结果发送。

图 5.8:用于分布式输出生成的分层代理链

Figure 5.8: Hierarchical agent chain for distributed output generation

共享工具模式

Shared tools pattern

多个代理访问通用工具包(例如 API、搜索引擎或向量数据库),以保持各项任务的一致性和效率。

Multiple agents access a common toolkit, such as APIs, search engines, or vector databases, to maintain consistency and efficiency across tasks.

结构和行为:代理 | 共享接口层 | 工具/数据库/API。

Structure and behavior: Agents | shared interface layer | tool/database/API.

设计理念:提高模块化程度,减少重复工作。支持集中式更新和监控。

Design rationale: Promotes modularity and reduces duplication. Allows centralized updates and monitoring.

实际应用

Practical application:

  • 代理程序针对不同的用户查询,查询同一个QA数据库。
  • Agents querying a single QA database for different user queries.
  • 所有代理均可访问的共享缓存或内存系统。
  • Shared cache or memory system accessed by all agents.
  • 统一知识图谱作为共享后端。
  • Unified knowledge graph as a shared backend.

该架构展示了一种先进的 HITL 系统,其中代理不仅相互协作,而且还使用共享工具在人工审核之前完善其输出。

This architecture demonstrates an advanced HITL system where agents not only collaborate but also use shared tools to refine their outputs before human review.

图示为一名被标记为“输入”的人向协调机器人发送数据,协调机器人将任务委派给三个代理机器人;一个代理直接发送输出,另一个代理涉及人类,所有代理都返回输出。

图 5.9:共享工具增强型智能体在人类监督下协作,以实现最佳输出

Figure 5.9: Shared tool-augmented agents collaborate under human supervision for optimized output

带有工具模式的数据库

Database with tools pattern

这种模式为代理提供了各种工具和数据库,可以实时提供、丰富或持久化知识,从而帮助进行智能决策。

This pattern surrounds agents with tools and databases that provide, enrich, or persist knowledge in real-time, aiding intelligent decision-making.

结构和行为:代理流程 | 外部工具提供转换 | 数据存储或前馈。

Structure and behavior: Agent processes | external tool provides transformation | data stored or fed forward.

设计理念:将计算与结构化持久化相结合。支持涉及状态跟踪和增强的复杂工作流程。

Design rationale: Combines computation with structured persistence. Supports complex workflows involving state tracking and enrichment.

实际应用

Practical application:

  • 使用网络爬虫工具丰富搜索结果,然后再将其存储到矢量数据库中。
  • Using a web scraping tool to enrich results before storage in a vector database.
  • 数据提取|情感增强|保存到分析存储库。
  • Data extraction | sentiment enrichment | Save to analytics store.

该架构展示了可通过访问矢量数据库和网络搜索引擎等外部工具来增强代理的功能,所有这些都通过 LLM 协调器进行协调,并由人工反馈指导:

This architecture showcases agents enhanced with access to external tools like vector databases and web search engines, all coordinated through an LLM orchestrator and guided by human feedback:

流程图显示输入信息先到达协调器,然后到达代理(LLM)。代理会分支到向量搜索、网络搜索或人工处理,所有路径的输出都返回到协调器,最终到达输出端。

图 5.10:代理使用向量和网络搜索工具生成丰富的、经人工验证的输出

Figure 5.10: Agents use vector and web search tools to generate enriched, human-verified output

使用工具进行内存转换

Memory transformation using tools

在这种模式下,代理会根据工具处理后的见解更新记忆,从而实现跨会话的学习和个性化。

In this pattern, agents update memory based on processed insights from tools, enabling learning and personalization across sessions.

结构和行为:代理或工具提取信号|记忆模块更新|未来的决策受记忆状态影响。

Structure and behavior: Agent or tool extracts signal | memory module is updated | future decisions influenced by memory state.

设计原理:支持能够从历史、偏好和互动中学习的自适应系统。

Design rationale: Supports adaptive systems that learn from history, preferences, and interactions over time.

实际应用

Practical application:

  • 聊天机器人会根据之前的对话调整语气。
  • Chatbots adjusting tone-based on prior conversations.
  • 情商代理正在更新用户情绪画像。
  • Emotional intelligence agents are updating user sentiment profiles.
  • 根据用户行为历史记录提供个性化产品推荐。
  • Personalized product recommendations based on user behavior history.

该图展示了一个人工智能编排框架,其中用户的输入由编排器管理,编排器使用语言模型和专门的代理来执行诸如向量搜索、网络搜索和记忆检索之类的任务,从而促进迭代的、HITL 输出:

This figure illustrates an AI orchestration framework where a user's input is managed by an orchestrator, which uses language models and specialized agents to conduct tasks like vector search, web search, and memory retrieval, facilitating iterative, HITL outputs:

流程图显示输入到达协调器,协调器指示代理执行向量搜索、网络搜索或咨询人类,所有输出都反馈到内存或进一步的代理操作中。

图 5.11:AI 编排架构

Figure 5.11: AI orchestration architecture

规划者-执行者模式

Planner-executor pattern

这种模式将系统分为一个规划代理(负责制定策略)和一个或多个执行代理(负责根据计划执行行动)。

This pattern divides the system into a planning agent that determines the strategy and one or more executors that carry out actions based on the plan.

结构和行为:规划者思考目标|制定计划|执行者逐步行动|反馈返回给规划者。

Structure and behavior: Planner reasons over goal | forms a plan | executors act step-by-step | feedback returned to planner.

设计理念:模拟人类认知(先思考后行动)。实现复杂的多步骤推理和可追溯的执行。

Design rationale: Mimics human cognition (thinking before acting). Enables complex, multi-step reasoning and traceable execution.

实际应用

Practical application:

  • 研究代理规划报告撰写步骤。
  • Research agent planning steps for report creation.
  • 使用预先规划的逻辑块生成代码。
  • Code generation with pre-planned logic blocks.
  • 在推理步骤之间使用工具进行多跳QA。
  • Multi-hop QA with tool use between reasoning steps.

下图展示了一个多智能体人工智能工作流程,图中显示了用户输入如何通过协调器和规划器智能体路由到专门的智能体,以执行诸如向量搜索和网络搜索之类的任务,并将输出输入到内存中进行迭代改进:

The following figure demonstrates a multi-agent AI workflow, showing how user input is routed via an orchestrator and planner agent to specialized agents for tasks such as vector search and web search, with outputs feeding into memory for iterative improvement:

人工智能系统流程图:输入通过逻辑层模型 (LLM) 到达协调器和规划器代理,然后分配给多个代理。代理执行向量搜索、网络搜索和内存访问,并将输出反馈到系统中。

图 5.12:AI 多智能体工作流程

Figure 5.12: AI multi-agent workflow

批评者或验证者模式

Critic or validator pattern

该模式包含一个验证器或评论器代理,用于审查并批准或要求修改另一个代理的输出。

This pattern includes a validator or critic agent that reviews and either approves or requests revisions of another agent’s output.

结构和行为:生产者代理|输出经验证者审核|已批准或已修改|最终输出。

Structure and behavior: Producer agent | output reviewed by validator | approved or revised | final output.

设计原理:提高可靠性,减少幻觉,并提供质量控制。起到内部反馈回路的作用。

Design rationale: Improves reliability, reduces hallucinations, and provides quality control. Acts as an internal feedback loop.

实际应用

Practical application:

  • 代码建议已通过测试或代码检查工具进行审核。
  • Code suggestion reviewed by a test or lint agent.
  • AI写作助手因语气或清晰度问题受到批评。
  • AI writing assistant critiqued for tone or clarity.
  • 事实核查代理正在验证生成的回复。
  • Fact-checker agent validating generated responses.

下图展示了一个基于代理的 AI 工作流程,其中包含一个协调器,该协调器处理输入并通过评论代理将任务委派给代理,通过反馈循环和代理协作来确保质量:

The following figure illustrates an agent-based AI workflow featuring an orchestrator that processes input and delegates tasks to agents through a critic agent, ensuring quality via feedback loops and agent collaboration:

流程图展示了输入如何传递到协调器,协调器再将数据发送给评论家代理。评论家代理会在两个代理之间进行选择——一条路径用绿色勾号标记,另一条路径用红色叉号标记——最终到达输出端。

图 5.13:基于代理的 AI 工作流程,采用评论家介导的任务执行和迭代代理反馈

Figure 5.13: Agent-based AI workflow with critic-mediated task execution and iterative agent feedback

谈判者模式

Negotiator pattern

具有不同目标或观点的主体通过反复沟通来达成决策或解决方案。这模拟了谈判、妥协或博弈论行为。

Agents with differing goals or perspectives communicate iteratively to reach a decision or resolution. This simulates negotiation, compromise, or game-theoretic behavior.

结构和行为:代理人交换报价或提议|状态根据偏好演变|达成一致或失败。

Structure and behavior: Agents exchange offers or proposals | state evolves based on preferences | agreement or failure.

设计原理:模拟现实世界中利益相关者的互动。适用于仿真或分布式决策系统。

Design rationale: Models real-world stakeholder interaction. Useful in simulations or distributed decision systems.

实际应用

Practical application:

  • 代表买卖双方进行价格谈判的人工智能代理。
  • AI agents representing the buyer and seller negotiating pricing.
  • 代理团队致力于优化产品设计中的权衡取舍。
  • Team of agents optimizing trade-offs in product design.
  • 在相互竞争的人工智能子系统之间进行资源分配。
  • Resource allocation across competing AI subsystems.

该图描绘了人工智能代理之间的协商工作流程。协商代理向每个代理发出两个信号:排名靠前的代理拒绝第一个信号但接受第二个信号,而排名靠后的代理则拒绝两个信号。这种选择性信号传递最终导致只有排名靠前的代理对最终输出做出贡献。

This figure depicts a negotiation workflow among AI agents. The negotiator agent issues two signals to each agent: the top agent declines the first signal but accepts the second, while the bottom agent declines both signals. This selective signaling ultimately results in only the top agent contributing to the final output:

图示协调器向协商代理发送输入,协商代理与三个代理图标交互。其中两个代理用红色 X 标记,一个用绿色对勾标记。箭头指向右侧的输出。

图 5.14 选择性输出

Figure 5.14: Selective output

多模态代理模式

Multimodal agent pattern

这种模式使用多个代理来处理不同类型的输入或输出(例如,文本、图像、音频),并将它们的见解结合起来,形成统一的结果。

This pattern uses multiple agents to process different types of input or output (e.g., text, image, audio), and combines their insights into a unified result.

结构和行为:输入按模态路由|模态特定处理|融合代理合并结果。

Structure and behavior: Input routed based on modality | modality-specific processing | fusion agent combines results.

设计原理:支持多感官人工智能系统,使其能够跨格式进行推理并提供更丰富的见解。

Design rationale: Enables multi-sensory AI systems that can reason across formats and deliver richer insights.

实际应用

Practical application:

  • AI助手处理图像和标题,然后用自然语言进行总结。
  • AI assistant processes the image and caption, then summarizes in natural language.
  • 视频转文本转录及语义分析。
  • Video-to-text transcription followed by semantic analysis.

下图展示了一个协调的 AI 系统,其中协调器利用语言模型将用户提供的文本或图像路由到专门的代理,这些代理处理后的输出被合并以获得统一的结果:

The following figure visualizes a coordinated AI system where an orchestrator leverages language models to route user-provided text or images to specialized agents, whose processed outputs are merged for a unified result:

流程图显示输入文本和输入图像到达协调器,协调器将每个输入文本和图像分别发送到单独的代理进行文本和图像处理,然后将输出合并为一个结果。

图 5.15:综合输出

Figure 5.15: Combined output

投票或共识模式

Voting or consensus pattern

多个智能体提供答案,最终结果根据共识、置信度或投票算法选出。

Multiple agents offer answers, and a final result is chosen based on consensus, confidence, or voting algorithms.

结构和行为:代理并行处理|提交预测或评估|聚合器计算最佳结果。

Structure and behavior: Agents process in parallel | submit predictions or evaluations | aggregator computes best result.

设计原理:提高可靠性和稳健性。减少单一数据源带来的偏差。

Design rationale: Boosts reliability and robustness. Reduces bias from a single source of truth.

实际应用

Practical application:

  • 用于数据标注的众包标注代理。
  • Crowd-sourced labeler agents for data annotation.
  • 模型集成投票用于分类任务。
  • Model ensemble voting for classification tasks.
  • 冗余的摘要器会产生多数结果。
  • Redundant summarizers produce a majority result.

图展示了一个多智能体人工智能决策框架,其中智能体对提出的解决方案进行投票,协调器选择获得共识批准的输出,以确保高质量的结果:

The following figure illustrates a multi-agent AI decision-making framework where agents vote on proposed solutions, with the orchestrator selecting the consensus-approved output to ensure high-quality results:

流程图显示“输入”发送到“协调器”,然后发送给三个机器人“代理”,由它们进行投票;其中一条路径标记为绿色对勾,另一条路径标记为红色叉号。获胜的投票决定“输出”。

图 5.16:多智能体 AI 投票系统简化决策过程以实现最佳输出

Figure 5.16: Multi-agent AI voting system streamlines decision-making for optimal output

主管-下属模式

Supervisor-subordinate pattern

主管代理负责监控和协调一组工作代理,并在需要时介入以指导、纠正或优化他们的行动。

The supervisor agent monitors and coordinates a group of working agents, stepping in when needed to guide, correct, or optimize their actions.

结构和行为:工作代理自主运行|主管观察指标或行为|必要时触发纠正。

Structure and behavior: Worker agents operate autonomously | supervisor observes metrics or behaviors | triggers correction if needed.

设计原理:在保持系统完整性的同时,允许在较低层级自主运行。

Design rationale: Maintains high system integrity while allowing autonomous operation at lower levels.

实际应用

Practical application:

  • 主管监控聊天客服人员的工作表现和客户满意度。
  • Supervisor monitoring chat agents’ performance and customer satisfaction.
  • 动态地重新训练或重新配置行为异常的代理。
  • Retraining or reconfiguring misbehaving agents dynamically.

该图展示了人工智能多智能体系统中常见的监督者-下属模式。在这个模式中,中央监督者(协调者)智能体接收用户输入,将任务委派给多个专门的下属智能体,并收集它们的输出。监督者集中控制通信、决策和任务分配,确保工作高效可靠地进行。下属智能体专注于执行特定任务并向监督者汇报,从而实现精简的协调、监控和故障恢复。

This figure represents the supervisor-subordinate pattern common in AI multi-agent systems, where a central supervisor (orchestrator) agent receives the user's input, delegates tasks to multiple specialized subordinate agents, and gathers their outputs. The supervisor centrally controls communication, decision-making, and task assignment, ensuring that work progresses efficiently and reliably. Subordinates focus on executing specific tasks and report back to their supervisor, enabling streamlined coordination, monitoring, and recovery if any agent fails:

流程图显示输入信号到达协调器,然后到达 LLM 规划器代理,该代理协调多个代理,每个代理产生输出。

图 5.17:分层 AI 工作流程

Figure 5.17: Hierarchical AI workflow

监视或恢复模式

Watchdog or recovery pattern

这种以弹性为中心的模式引入了一个监视代理,该代理会观察系统健康状况,并在发生故障或延迟时启动恢复。

This resilience-focused pattern introduces a watchdog agent that observes system health and initiates recovery if failures or delays occur.

结构和行为:被动监控|检测故障或超时|重新运行、升级或切换路径。

Structure and behavior: Passive monitoring | detect failure or timeout | rerun, escalate, or switch paths.

设计原理:提高系统的稳健性、正常运行时间和可恢复性。这对于生产级系统至关重要。

Design rationale: Improves robustness, uptime, and system recoverability. Crucial in production-grade systems.

实际应用

Practical application:

  • API宕机时触发备用搜索。
  • Triggering fallback search if the API is down.
  • 重新初始化崩溃的代理。
  • Reinitializing crashed agents.
  • 发生错误时,切换到冗余工作流程。
  • Failing over to redundant workflows during errors.

该图展示了一个强大的AI编排框架,其中编排器利用LLM将传入的任务委派给多个专业代理。每个代理都与一个监控模块配对,该模块是一个自主监控器,用于确保任务的可靠性和质量:

This figure depicts a robust AI orchestration framework where an orchestrator leverages an LLM to delegate incoming tasks to multiple specialized agents. Each agent is paired with a watchdog module, an autonomous monitor ensuring task reliability and quality:

流程图显示,输入数据先进入协调器,然后分成多个分支,分别发送到看门狗机器人和代理机器人,每个分支都产生输出。箭头连接流程中的各个步骤。

图 5.18 利用监控系统进行 AI 编排

Figure 5.18: AI orchestration with watchdogs

时间规划模式

Temporal planner pattern

这种规划者-执行者模式的变体包含了时间约束、调度逻辑和截止日期意识。

This variation of the planner-executor pattern incorporates time constraints, scheduling logic, and deadline awareness.

结构和行为:计划包括时间戳或持续时间|执行器根据计划运行任务|基于时间的决策会影响流程。

Structure and behavior: Plan includes timestamps or durations | executors run tasks based on schedule | time-based decisions affect flow.

设计原理:对实时或延迟执行场景至关重要。支持长期规划。

Design rationale: Essential for real-time or delayed execution scenarios. Supports long-horizon planning.

实际应用

Practical application:

  • 会议或活动日程安排代理。
  • Scheduling agent for meetings or events.
  • 自动报告代理每隔几个小时执行一次。
  • Automated reporting agent executes every few hours.
  • 基于时间的任务优先级排序。
  • Time-dependent task prioritization.

图描绘了人工智能代理系统中的主管-下属模式,重点强调了时间维度:主管在多个不同的阶段或时间步长内向下属代理下达任务。在每个阶段,下属代理执行特定操作并报告其输出;主管评估进度,更新策略,并根据累积结果和不断变化的环境分配后续任务。这种循环的、时间感知的交互确保系统能够动态地调整和协调代理在多阶段流程中的工作。

The following figure portrays the supervisor-subordinate pattern in AI agent systems, this time emphasizing the temporal dimension: the supervisor issues tasks to subordinate agents across several distinct phases or time steps. At each phase, subordinates execute specific actions and report their outputs; the supervisor assesses progress, updates the strategy, and delegates subsequent tasks based on cumulative results and changing context. This cyclical, time-aware interaction ensures that the system dynamically adapts and coordinates agent efforts throughout multi-stage processes.

流程图显示,输入数据先到达协调器,然后到达规划器代理。规划器将任务委派给两个代理,每个代理都使用不同的工具。代理 2 可以访问内存,输出结果从代理流回用户。

图 5.19 主管代理编排

Figure 5.19: Supervisor agent orchestration

在探索了从简单的顺序链到复杂的分层、验证器和共识框架等19种多智能体系统设计模式之后,我们现在已准备好将理论应用于实践。这些模式的丰富性不仅体现在学术层面,更构成了构建智能、模块化和可扩展的GenAI系统的架构骨架。

Having explored the full spectrum of 19 multi-agent system design patterns, from simple sequential chains to complex hierarchical, validator, and consensus-based frameworks, we are now ready to transition from theory to practice. The richness of these patterns is not just academic; it forms the architectural backbone for building intelligent, modular, and scalable GenAI systems.

在本节中,我们将通过构建一个可用于生产环境的真实 HITL 多智能体 RAG 系统来实践这些模式。该实现将利用 LangGraph 的状态图编排功能,根据特定任务的逻辑和反馈,动态地在智能体和工具之间路由控制。

In this section, we will bring these patterns to life by constructing a real-world, production-ready HITL multi-agent RAG system. This implementation will utilize the StateGraph orchestration capabilities of LangGraph to dynamically route control across agents and tools based on task-specific logic and feedback.

该系统将着重强调模块化、可扩展性和完全本地执行,不依赖任何外部 API 或 OpenAI 服务。我们将采用以下方式:

This system will emphasize modularity, extensibility, and full local execution, with no dependency on external APIs or OpenAI services. Instead, we will use the following:

  • 局部嵌入模型(例如,Nomic)。
  • A local embedding model (e.g., Nomic).
  • Chroma 作为矢量存储后端。
  • Chroma as the vector store backend.
  • 一种结合了最佳匹配 25 ( BM25 ) 和语义相似性的混合检索器。
  • A hybrid retriever combining Best Matching 25 (BM25) and semantic similarity.
  • ReAct 式提示,实现透明推理。
  • ReAct-style prompting for transparent reasoning.
  • 集成 PDF 解析功能,支持非结构化知识的摄取。
  • Integrated PDF parsing to support unstructured knowledge ingestion.
  • 追踪来源引用以评估输出结果的可靠性。
  • Source citation tracking for output reliability.
  • 至关重要的是,HITL 检查点用于需要人工判断、验证或干预的场景。
  • And crucially, a HITL checkpoint for scenarios requiring human judgment, validation, or intervention.

我们将把系统架构设计成一个多智能体工作流,利用 LangGraph 的状态图连接承担不同职责的智能体:检索、评分、生成和人工审核。每个组件都将采用模块化设计,以便于调试、定制和重用。

We will architect the system as a multi-agent workflow, using LangGraph's StateGraph to connect agents with different responsibilities: retrieval, grading, generation, and human oversight. Each component will be built in a modular fashion to enable debugging, customization, and reuse.

接下来不仅演示了智能体推理,还提供了一个蓝图,展示了现实世界中的 GenAI 应用如何将自主性与责任性、推理能力与可靠性以及速度与安全性结合起来。现在,让我们一起了解这个智能系统的架构、文件夹结构以及逐步实现过程:

What follows is not just a demonstration of agentic reasoning, but a blueprint for how real-world GenAI applications can combine autonomy with accountability, reasoning with reliability, and speed with safety. Let us now walk through the architecture, folder structure, and step-by-step implementation of this intelligent system:

“rag_hlt_langgraph”的文件目录树显示了 Python 脚本和文件夹,其中包含嵌入、向量存储、检索器、解析器、内存、评分器、实用程序和示例数据等组件以及描述性注释。

图 5.20 该图显示了带有 HITL 的 RAG 的文件夹结构。

Figure 5.20: The figure shows the folder structure of a RAG with HITL

人机交互

Human-in-the-loop

在生产级人工智能系统中,最关键的设计要素之一是信任,它确保输出结果准确、可靠且符合上下文。这在教育、医疗保健、法律研究或企业文档等场景中尤为重要,因为错误或误导性的回复可能会造成严重后果。为了解决这个问题,我们的系统集成了一种名为 HITL 的设计模式。

One of the most critical design elements in production-grade AI systems is trust, ensuring that the outputs are accurate, grounded, and contextually appropriate. This is especially important in scenarios like education, healthcare, legal research, or enterprise documentation, where an incorrect or misleading response can have serious consequences. To address this, our system integrates a design pattern known as HITL.

简而言之,HITL(人机交互)意味着人工智能并非始终自主运行。相反,在特定的决策点,例如生成答案后,系统会暂停并请求人工验证。这确保了在人工智能的回复最终确定或被执行之前,人们有机会批准、拒绝或要求重新生成回复。

In simple terms, HITL means that the AI does not always operate autonomously. Instead, at specific decision points, such as after generating an answer, the system pauses and asks for human validation. This ensures that a person has the opportunity to approve, reject, or request regeneration of the AI's response before it is finalized or acted upon.

在我们的实现中,HITL 逻辑是 LangGraph 工作流程的一部分。RAG 代理使用混合检索和本地 LLM 推理生成答案后,系统会打印结果及其来源。然后,它会显式调用一个函数来提示用户:

In our implementation, the HITL logic is part of the LangGraph workflow. After the RAG agent produces an answer using hybrid retrieval and local LLM reasoning, the system prints the result along with its sources. It then explicitly calls a function that prompts the human user:

def human_approval_required():

def human_approval_required():

return input("\n是否同意答案?(是/否):").strip().lower() != "yes"

return input("\nApprove the answer? (yes/no): ").strip().lower() != "yes"

如果用户输入除“是”以外的任何内容,系统将假定答案不令人满意。重试循环允许最多三次重新生成答案,之后将停止并显示类似“多次尝试后答案被拒绝”的消息。

If the user types anything other than yes, the system assumes the answer is unsatisfactory. A retry loop allows up to three regeneration attempts before it halts with a message like answer rejected after multiple attempts.

对于人工智能领域的新手来说,人机交互(HITL)是弥合人工智能自主性与人类判断之间鸿沟的关键机制。它将负责任的人工智能真正付诸实践,不再仅仅是一个流行语,而是作为一项嵌入系统架构中的切实保障措施。

For new AI practitioners, HITL is an essential mechanism to bridge the gap between AI autonomy and human judgment. It brings responsible AI into action, not just as a buzzword, but as a practical safeguard embedded within the system's architecture.

让我们来详细分析一下HITL在这个系统中是如何实现的:

Let us unpack how HITL is implemented in this system:

  • 显式调用:HITL 逻辑定义在grader/human_feedback.py中,并在 RAG 工作流程中生成答案后显式调用。该函数提示用户批准或拒绝该回复:

    def human_approval_required():

    return input("\n是否同意答案?(是/否):").strip().lower() != "yes"

  • Explicit invocation: The HITL logic is defined in grader/human_feedback.py and is explicitly invoked in the RAG workflow after the answer is generated. The function prompts the user to approve or reject the response:

    def human_approval_required():

    return input("\nApprove the answer? (yes/no): ").strip().lower() != "yes"

  • 重试循环逻辑:在编排器(orchestrator/langgraph_flow.py )中,函数run_rag_workflow()包含一个重试循环,如果用户不批准答案,系统最多可以尝试重新生成三次:

    重试次数 = 3

    对于尝试次数在范围(重试次数)内:

    ...

    如果不是需要人工批准:

    返回结果["答案"]

    print("\n正在尝试回答同一个问题...")

  • Retry loop logic: Within the orchestrator (orchestrator/langgraph_flow.py), the function run_rag_workflow() includes a retry loop that allows the system to attempt regeneration up to three times if the human does not approve the answer:

    retries = 3

    for attempt in range(retries):

    ...

    if not human_approval_required():

    return result["answer"]

    print("\nRetrying with same question...")

  • 优雅的回退处理:如果用户拒绝所有三个重新生成的答案,系统将退出循环并返回清晰的回退消息:
  • Graceful fallback handling: If the human rejects all three regenerated answers, the system exits the loop and returns a clear fallback message:

返回“多次尝试后答案被拒绝”。

return "Answer rejected after multiple attempts."

让我们来了解一下这为什么重要:

Let us understand why this matters:

  • 信任:确保在 LLM 反应可能不确定或敏感的情况下进行人工验证。
  • Trust: Ensures human validation in cases where LLM responses may be uncertain or sensitive.
  • 控制:使操作员能够根据需要拒绝和重新生成输出。
  • Control: Enables the human operator to reject and regenerate outputs as needed.
  • 安全网:防止未经验证而使用误导性输出。
  • Safety net: Prevents misleading outputs from being used without verification.

HITL 的这项功能引入了额外的控制和问责机制,使系统更适合实际部署。它将纯粹的自主代理系统转变为协作式工作流程,使人类和人工智能能够共同协作,产生可靠的结果。

This HITL feature introduces an extra layer of control and accountability, making the system more suitable for real-world deployment. It transforms a purely autonomous agentic system into a collaborative workflow, where humans and AI work together to produce reliable results.

现在我们已经了解了 HITL 的架构及其在确保 AI 输出可信度方面的作用,接下来让我们探讨一下该系统在代码中的实现方式。以下部分将逐步介绍每个模块的完整实现,重点阐述本地嵌入、向量搜索、混合检索、ReAct 提示以及基于 LangGraph 的编排如何协同工作,从而构建一个智能、可控且完全本地化的 RAG 流水线。

Now, that we understand the architecture and the role of HITL in ensuring trustworthy AI output, let us explore how this system is implemented in code. The following section walks through the full implementation of each module, step-by-step, highlighting how local embeddings, vector search, hybrid retrieval, ReAct prompting, and LangGraph-based orchestration come together to power an intelligent, controllable, and fully local RAG pipeline.

端到端的人机交互 RAG 工作流程

End-to-end human-in-the-loop RAG workflow

此实现演示了一个完整的 HITL RAG 系统,该系统由 LangChain 组件编排,并设计为完全本地执行。系统首先解析本地 PDF 文档并将其分块,然后使用本地生成的嵌入将这些分块存储在持久化矢量数据库 (Chroma) 中。

This implementation demonstrates a complete HITL RAG system, orchestrated with LangChain components and designed for full local execution. The system begins by parsing and chunking a local PDF document, then storing those chunks in a persistent vector database (Chroma) using locally generated embeddings.

混合检索器结合了BM25关键词搜索和向量相似度,以识别与用户查询相关的语块。检索到的上下文信息被传递给ReAct风格的提示链,使本地语言模型能够逐步推理,最终生成简洁的答案。

A hybrid retriever combines BM25 keyword search and vector similarity to identify relevant chunks in response to user queries. The retrieved context is passed to a ReAct-style prompting chain that enables the local language model to reason step-by-step before generating a concise answer.

该系统能够维护对话记忆,从而确保跨多个用户输入的上下文连续性。生成答案后,系统会调用一个 HITL 函数,该函数会暂停以请求用户确认。如果响应被拒绝,系统最多会重试三次,然后优雅地终止流程。

The system maintains conversational memory, enabling contextual continuity across multiple user inputs. After generating an answer, the system invokes a HITL function that pauses to request user approval. If the response is rejected, the system retries up to three times before gracefully terminating the flow.

该架构采用模块化和可扩展设计,使其适用于对答案准确性、可追溯性和人工监督要求极高的企业级应用。

This architecture is modular and scalable, making it suitable for enterprise-grade applications where answer accuracy, traceability, and human oversight are essential.

完整的源代码和文件结构请参考 GitHub 仓库。

For the complete source code and file structure, refer to the GitHub repository.

从 HITL 到多智能体人机交互 RAG

From HITL to multi-agent human-in-the-loop RAG

在前一节中,我们探讨了一种 HITL RAG 架构,其中系统会在最终确定答案之前暂停以等待用户验证。虽然这允许进行监督,但其结构仍然很大程度上是线性的、单体的,所有逻辑都集中在一个链中。

In the previous section, we explored a HITL RAG architecture where the system paused for user validation before finalizing any answer. While this allowed for oversight, the structure was still largely linear and monolithic, with all logic centralized in a single chain.

为了真正符合多智能体设计原则,我们现在将 RAG 流水线分解为模块化的、可交互的智能体,每个智能体负责特定的角色。这些智能体包括:

To truly align with multi-agent design principles, we now decompose the RAG pipeline into modular, interacting agents, each responsible for a specific role. These agents include:

  • 用于获取相关文档的检索代理。
  • A retrieval agent for fetching relevant documents.
  • 一种利用ReAct提示技术合成答案的生成代理,
  • A generation agent for synthesizing answers using the ReAct prompting technique,
  • 一种会明确征求用户批准的人工反馈代理,如果用户不批准,则会将系统循环返回。
  • A human feedback agent that explicitly asks for approval and loops the system back if the user disapproves.

此设计使用 LangGraph 的状态图来编排工作流,具有清晰的转换和条件路由。与之前的实现不同,每个代理在逻辑中都是隔离的。但通过图进行协调,确保了模块化、可重用性和透明性。重试逻辑也已嵌入:如果人工审核未通过响应,生成步骤最多可以重新执行三次。

This design uses LangGraph’s StateGraph to orchestrate the workflow, with clear transitions and conditional routing. Unlike the previous implementation, each agent is isolated in logic but coordinated through the graph, ensuring modularity, reusability, and transparency. Retry logic is also embedded: the generation step can re-execute up to three times if the human does not approve the response.

通过这些结构性变化,我们现在实现了真正的多智能体 HITL RAG 系统,该系统既可在本地部署,又可由人控制。

With these structural changes, we now achieve a true multi-agent HITL RAG system, which is both locally deployable and human-controllable.

图 5.21展示了一个增强了智能体组件的 HITL RAG 架构。该流程始于用户查询,查询结果与填充了文档嵌入的向量数据库进行匹配。文档首先被分块,并添加元数据,然后使用嵌入模型进行嵌入。混合检索代理根据查询提取相关块,结果生成代理生成响应。响应随后进入人工反馈循环,HITL 代理会批准输出结果,或者触发最多三次的重试机制。如果响应仍然不令人满意,则拒绝该响应;否则,将其返回给用户。

Figure 5.21 illustrates a HITL RAG architecture enhanced with agentic components. The process begins with a user query, which is matched against a vector database populated with document embeddings. Documents are first chunked with metadata and embedded using an embedding model. A hybrid retrieval agent fetches relevant chunks based on the query, and a result generation agent synthesizes a response. The response then enters a human feedback loop, where a HITL agent either approves the output or triggers a retry mechanism up to three times. If the response remains unsatisfactory, it is rejected; otherwise, it is returned to the user.

流程图显示了查询如何经过嵌入模型、带有元数据的分块文档、矢量数据库、混合检索代理和结果生成代理进行处理,最终由人工反馈批准、重试或拒绝结果。

图 5.21:具有代理反馈回路的端到端 HITL RAG 工作流程

Figure 5.21: End-to-end HITL RAG workflow with agentic feedback loop

有关完整的源代码和文件结构,请参阅Chapter_5_code.ipynb 智能体人机交互。

For the complete source code and file structure, refer to Chapter_5_code.ipynb, multi-agent human-in-the-loop.

检索代理负责根据用户的问题获取相关的文档片段。它使用了一种结合了BM25和向量相似度的混合检索器。

The retrieval agent is responsible for fetching relevant document chunks based on the user’s question. It uses a hybrid retriever that combines BM25 and vector similarity.

检索代理

Retrieval agent:

def retrieval_agent(state):

def retrieval_agent(state):

返回 {"documents": retriever.get_relevant_documents(state["question"])}

return {"documents": retriever.get_relevant_documents(state["question"])}

该生成代理使用类似 ReAct 的提示链合成响应,并根据检索到的上下文逐步推理。它还会附加来源引用,以确保可追溯性,并为生成的答案提供透明的依据。

The generation agent synthesizes a response using a ReAct-style prompting chain, reasoning step-by-step over the retrieved context. It also attaches source citations to ensure traceability and provide transparent grounding for the generated answers.

生成代理

Generation agent:

def generation_agent(state):

def generation_agent(state):

result = rag_chain.invoke({"question": state["question"]})

result = rag_chain.invoke({"question": state["question"]})

返回 {

return {

"答案": result["答案"],

"answer": result["answer"],

"source_documents": result.get("source_documents", [])

"source_documents": result.get("source_documents", [])

}

}

人工反馈循环在循环中引入了审批步骤。每次生成答案后,系统都会提示用户进行验证——允许用户批准、拒绝或要求重新生成——从而实现可控监督和迭代改进。

The human feedback loop introduces an approval step into the loop. After each generated answer, it prompts for user validation—allowing humans to approve, reject, or request re-generation—enabling controlled oversight and iterative refinement.

人脑反馈循环

Human feedback loop:

def human_feedback_agent(state):

def human_feedback_agent(state):

已批准 = 不需要人工审批()

approved = not human_approval_required()

返回 {"已批准": 已批准}

return {"approved": approved}

如果用户拒绝该答案,系统将返回给生成代理进行重试,最多尝试三次。如果用户接受该答案,则答案最终确定并返回。

If the user rejects the answer, the system loops back to the generation agent for a retry, up to a maximum of three attempts. If the user approves, the answer is finalized and returned.

LangGraph 的状态图管理代理之间的流程。它定义了一个有向图,并利用该有向状态图协调整个流程。它对代理的执行进行排序,从数据检索、生成到验证,并根据用户反馈动态路由,支持循环重试和基于审批逻辑的优雅退出。

LangGraph’s StateGraph manages the flow across agents. It defines a directed graph, and LangGraph orchestrates the entire pipeline using a directed state graph. It sequences agent execution from retrieval to generation to validation, dynamically routing based on human feedback, enabling looped retries and graceful exits based on approval logic.

使用 LangGraph 进行编排

Orchestration using LangGraph:

工作流 = 状态图(GraphState)

workflow = StateGraph(GraphState)

workflow.add_node("retrieve", retrieval_agent)

workflow.add_node("retrieve", retrieval_agent)

workflow.add_node("generate", generation_agent)

workflow.add_node("generate", generation_agent)

workflow.add_node("validate", human_feedback_agent)

workflow.add_node("validate", human_feedback_agent)

workflow.set_entry_point("retrieve")

workflow.set_entry_point("retrieve")

workflow.add_edge("检索", "生成")

workflow.add_edge("retrieve", "generate")

workflow.add_edge("生成", "验证")

workflow.add_edge("generate", "validate")

workflow.add_conditional_edges(

workflow.add_conditional_edges(

“证实”,

"validate",

lambda 状态:如果 state.get("approved") 为 "end",否则为 "generate",

lambda state: "end" if state.get("approved") else "generate",

{

{

“结束”:结束,

"end": END,

"生成": "生成"

"generate": "generate"

}

}

)

最终生成的图表会被编译,并在main.py文件中用于交互式地处理用户输入。

The final graph is compiled and used in the main.py file to handle user input interactively.

这种架构体现了本章前面“构建智能体GenAI系统”部分讨论的智能体设计原则:每个智能体都是独立的、可测试的、可扩展的,从而为智能检索系统提供了灵活而稳健的基础。人工验证的集成确保了系统不仅能够回答问题,而且能够负责任地回答问题。

This architecture exemplifies the agentic design principles discussed earlier in the chapter section: Architecting agentic GenAI systems, each agent is isolated, testable, and extensible, enabling a flexible and robust foundation for intelligent retrieval systems. The integration of human validation ensures that the system not only answers, but answers responsibly.

智能体人工智能与人工智能代理

Agentic AI vs. AI agents

要真正理解本章探讨的架构设计模式,必须区分人工智能代理和更高级的智能体人工智能范式。尽管这两个术语有时可以互换使用,但它们代表了人工智能在能力、自主性和系统协调方面的根本不同层次。

To truly appreciate the architectural design patterns we have explored in this chapter, it is essential to distinguish between AI agents and the more advanced paradigm of agentic AI. Although these terms are sometimes used interchangeably, they represent fundamentally different levels of capability, autonomy, and system coordination in AI.

人工智能代理是自主软件程序,旨在以最少的人工干预执行特定任务。这些系统擅长处理狭窄且定义明确的领域,例如回答客户服务咨询、安排会议或从API检索特定数据。它们的行为通常是被动的,对输入或触发条件做出响应,并且通常遵循线性、单步的执行模式。虽然它们可以使用API​​或数据库等工具,但它们的自主性通常局限于特定边界,无法进行更高层次的规划或协同推理。

AI agents are autonomous software programs designed to perform specific tasks with minimal human intervention. These systems excel in narrow, well-defined domains such as answering customer service queries, scheduling meetings, or retrieving specific data from APIs. Their behavior is typically reactive, responding to input or triggers, and they often follow a linear, single-step execution pattern. While they can use tools like APIs or databases, their autonomy is generally confined to specific boundaries and does not extend to higher-order planning or collaborative reasoning.

相比之下,智能体人工智能指的是一个更为复杂的系统,它由多个人工智能体组成,这些智能体协同工作以解决更高阶的问题。这些系统超越了简单的执行,而是专注于目标设定、高级规划以及跨多个步骤的协调。智能体人工智能体现了多智能体协作、用于情境感知的持久记忆以及基于不断变化的环境进行自适应决策等特征。与独立运行的传统人工智能体不同,智能体人工智能系统以协调网络的形式运行,其中的智能体可以共享信息、委派任务并动态地适应新的目标或环境。

In contrast, agentic AI refers to a more complex system composed of multiple AI agents working collaboratively to solve higher-order problems. These systems go beyond execution and instead focus on goal-setting, advanced planning, and orchestration across multiple steps. Agentic AI embodies characteristics such as multi-agent collaboration, persistent memory for contextual awareness, and adaptive decision-making based on evolving conditions. Unlike traditional AI agents that operate independently, agentic AI systems function as coordinated networks where agents can share information, delegate tasks, and adapt to new goals or contexts dynamically.

智能体人工智能的关键架构转变之一,是从孤立的任务执行转向系统级的协调。在这种架构中,更高级别的控制器(或协调器智能体)负责协调各个专业智能体的行为,使系统能够将复杂的目标分解为可管理的子任务。每个专业智能体都为实现总体目标做出贡献,而协调器则整合它们的输出,最终达成一致且目标导向的结果。

One of the key architectural shifts in agentic AI is the movement from isolated task execution to system-level orchestration. Here, a higher-level controller, or an orchestrator agent, coordinates the behavior of specialized agents, enabling the system to decompose complex goals into manageable subtasks. Each specialized agent contributes to a portion of the overall objective, and the orchestrator integrates their outputs to achieve coherent, goal-directed outcomes.

此外,虽然人工智能代理通常依赖于针对特定任务定制的基于规则或监督学习,但智能体人工智能则利用更复杂的学习策略,例如强化学习、元学习混合方法,从而能够适应更广泛的任务领域。这种适应性在供应链优化、虚拟项目管理和企业自动化等应用中至关重要,因为在这些应用中,静态响应不足以应对挑战,需要动态的目标设定和推理。

Additionally, while AI agents often rely on rule-based or supervised learning tailored to narrow tasks, agentic AI leverages more sophisticated learning strategies such as reinforcement learning, meta-learning, or hybrid approaches that allow for adaptation across broader task domains. This adaptability is crucial in applications like supply chain optimization, virtual project management, and enterprise automation, where static responses are insufficient, and dynamic goal-setting and reasoning are required.

智能体人工智能还强调持久记忆,这是一种共享上下文,使智能体能够记住之前的交互、跟踪依赖关系并随着时间的推移更新策略。这种记忆形式不仅是一种技术特性,更是一种战略赋能因素,它使智能体能够……在彼此工作的基础上继续努力,最大限度地减少重复处理,并不断改进决策。

Agentic AI also emphasizes persistent memory, a shared context that enables agents to remember previous interactions, track dependencies, and update strategies over time. This form of memory is not just a technical feature but a strategic enabler that allows agents to build upon one another’s work, minimize redundant processing, and refine their decisions continuously.

本质上,人工智能代理是工具,而智能体人工智能则是一个思考系统——它自主、交互式,并且能够进行复杂的规划。在构建现实世界的智能体系统时,这种区别将指导您的架构选择,帮助您选择合适的工具、协调机制和推理策略,从而将人工智能的应用范围从狭义的任务扩展到通用、自主的工作流程。

In essence, while AI agents are tools, agentic AI is a system of thinkers—autonomous, interactive, and capable of complex planning. As you move forward with building real-world agentic systems, this distinction will guide your architectural choices, helping you select the right tools, coordination mechanisms, and reasoning strategies needed to scale beyond narrow AI tasks toward general-purpose, autonomous workflows.

结论

Conclusion

本章探讨了构建智能体GenAI系统的基本原则,重点阐述了AI智能体如何通过结构化的多智能体协调,从被动执行者演化为协作式问题解决者。我们考察了关键的设计模式,例如顺序模式、循环模式、路由模式和层级模式,这些模式使智能体能够在复杂的工作流程中进行推理、检索、行动和适应。在此基础上,我们引入了RAG框架下的HITL架构,展示了人类如何引导或验证智能体的决策。最后,我们区分了传统AI智能体和智能体AI,强调了后者侧重于多步骤规划、编排和自适应学习。这些概念为构建能够处理现实世界复杂性的动态自主系统奠定了基础。在下一章中,我们将从架构模式过渡到执行策略,实现具有评分机制的两阶段GenAI系统。评分机制是生产级应用中质量控制、响应排序和稳健系统评估的关键技术。

In this chapter, we explored the foundational principles of architecting agentic GenAI systems, emphasizing how AI agents evolve from reactive executors to collaborative problem-solvers through structured multi-agent coordination. We examined key design patterns, such as sequential, loop, router, and hierarchical, that enable agents to reason, retrieve, act, and adapt in complex workflows. Building on this, we introduced HITL architectures within RAG, showcasing how humans can guide or validate agentic decisions. Finally, we distinguished between traditional AI agents and Agentic AI, highlighting the latter’s focus on multi-step planning, orchestration, and adaptive learning. These concepts lay the groundwork for building dynamic, autonomous systems capable of handling real-world complexity. In the next chapter, we transition from architectural patterns to execution strategies by implementing two-stage GenAI systems enhanced with grading mechanisms, a crucial technique for quality control, response ranking, and robust system evaluation in production-grade applications.

下一章我们将探讨密集检索中的交互机制及其在两阶段和多阶段 RAG 系统中的关键作用。主题包括重排序策略(例如延迟交互和完全交互)、多向量方法、评分机制,以及具有路由和分阶段推理的多阶段 RAG 工作流的实际实现。

In the next chapter, we will explore interaction mechanisms in dense retrievals and their critical role in two-stage and multi-stage RAG systems. Topics include reranking strategies such as late and full interaction, multi-vector approaches, grading mechanisms, and a practical implementation of a multi-stage RAG workflow with routing and staged reasoning.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第六章两阶段和多阶段GenAI系统

CHAPTER 6Two and Multi-stage GenAI Systems

介绍

Introduction

随着生成式人工智能GenAI )系统在企业、研究和消费应用中日益普及,对可靠且值得信赖的输出的需求也空前高涨。尽管大型语言模型LLM )能够生成流畅且符合上下文的答案,但它们常常存在一个关键缺陷——“幻觉”。这些捏造或不准确的输出会损害用户信任,并在医疗保健、法律、金融和客户支持等高风险领域带来重大风险。本章介绍了一种实用且可扩展的解决方案:一个两阶段生成流程,在向用户展示答案之前,将答案评分和重排序作为验证层。通过使用反馈循环系统地评估生成的答案,我们实现了从被动生成到主动质量控制的转变,从而为构建更可靠的GenAI系统奠定了基础。

As generative AI (GenAI) systems have become more prevalent in enterprise, research, and consumer applications, the demand for reliable and trustworthy outputs has never been higher. While large language models (LLMs) are capable of generating fluent and contextually appropriate answers, they often suffer from a critical flaw, which is hallucination. These fabricated or inaccurate outputs can undermine user trust and introduce significant risk in high-stakes domains like healthcare, law, finance, and customer support. This chapter introduces a practical and scalable solution: a two-stage generative pipeline that integrates answer grading and reranking as a validation layer before responses are surfaced to users. By systematically evaluating generated answers using a feedback loop, we shift from passive generation to active quality control, laying the foundation for more dependable GenAI systems.

你将使用 Python、LangChain 和 LangGraph 实现此架构,构建一个由检索器、生成器和评分器组成的模块化流程。检索器收集多个相关的知识上下文,生成器提出候选答案,评分器使用自定义的评估提示或评分机制选择最准确或最合适的答案。最终,你不仅会理解答案验证背后的理论,还会获得构建一个既稳健又可用于生产环境的 GenAI 反馈系统的实践经验。

You will implement this architecture using Python, LangChain, and LangGraph, constructing a modular pipeline consisting of a retriever, generator, and grader. The retriever gathers multiple relevant knowledge contexts, the generator proposes candidate answers, and the grader selects the most accurate or appropriate response using custom evaluation prompts or scoring mechanisms. By the end, you will not only understand the theory behind answer validation but also gain hands-on experience in engineering a GenAI feedback system that is both robust and production-ready.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 密集检索中的交互概念
  • Concepts of interactions in dense retrievals
  • 交互模型在两阶段 RAG 系统中的作用
  • Role of interaction models in two-stage RAG systems
  • 使用各种交互模型进行重排序
  • Reranking with various interaction models
  • 两阶段 RAG 架构
  • Two-stage RAG architecture
  • 多阶段 RAG
  • Multi-stage RAG
  • 评分机制
  • Grading mechanisms
  • 实现多阶段 RAG 工作流程及路由
  • Implementation of multi-stage RAG workflow with routing

目标

Objectives

本章旨在全面阐述高级检索增强生成RAG )系统,重点关注密集检索交互和多阶段处理的作用。首先,本章探讨密集检索中交互的基本概念,然后深入讨论两阶段和多阶段RAG架构。接下来,本章介绍用于评估检索和生成质量的评分机制。最后,本章展示了一个具有智能路由的多阶段RAG工作流的实用实现,该工作流能够利用向量搜索和网络搜索进行自适应查询处理。读者将获得清晰的概念理解和构建稳健RAG系统的实践经验。

This chapter aims to provide a comprehensive understanding of advanced retrieval-augmented generation (RAG) systems, with a focus on the role of dense retrieval interactions and multi-stage processing. It begins by exploring the fundamental concepts of interaction in dense retrieval, followed by an in-depth discussion of two-stage and multi-stage RAG architectures. The chapter then introduces the grading mechanisms used to evaluate retrieval and generation quality. Finally, it presents a practical implementation of a multi-stage RAG workflow with intelligent routing, enabling adaptive query processing using vector search and web search. Readers will gain both conceptual clarity and hands-on insights into building robust RAG systems.

第一章“新时代生成式人工智能简介”中,我们探讨了双编码器和交叉编码器的概念,以及如图 6.4所示的两阶段生成式人工智能系统的架构。在深入了解两阶段生成式人工智能架构之前,让我们先来考察一下不同的交互级别,具体来说,就是无交互、延迟交互和完全交互。

In Chapter 1, Introducing New Age Generative AI, we explored the concepts of bi-encoders and cross-encoders, along with the architecture of a two-stage GenAI system as illustrated in Figure 6.4. Before we get into the two-stage GenAI architecture, let us first examine the different levels of interactions, specifically, no interaction, late interaction, and full interaction.

密集检索中的交互概念

Concepts of interactions in dense retrievals

第一章“新时代生成式人工智能导论”中,我们介绍了密集检索的概念。在密集检索系统中,查询和文档在编码和比较过程中的交互方式对检索性能和计算效率起着至关重要的作用。总体而言,交互机制可分为三类:无交互、完全交互、延迟交互和多向量表示,每类机制在可扩展性和语义匹配精度之间都存在独特的权衡。

In Chapter 1, Introducing New Age Generative AI, we introduced the concept of dense retrieval. In dense retrieval systems, the way queries and documents interact during encoding and comparison plays a central role in determining both retrieval performance and computational efficiency. Broadly, interaction mechanisms fall into three categories, which are no interaction, full interaction, late interaction, and multi-vector representation, each offering unique trade-offs between scalability and semantic matching precision.

无互动

No interaction

双编码器架构代表了一种可扩展性强但粒度较粗的方法。在这种方法中,查询和文档分别使用独立或共享的神经编码器编码成固定长度的向量嵌入。编码完成后,使用诸如余弦相似度或点积等轻量级相似度函数比较这些向量,从而利用近似最近邻ANN )搜索实现快速检索。由于其速度和效率,这种方法被广泛应用于大规模系统中。然而,由于查询和文档词元在编码过程中没有交互,语义对齐相对较浅,常常会丢失更精细的上下文线索。这种方法尤其适用于第一阶段检索,因为在第一阶段检索中,速度比精度更重要。

The bi-encoder architecture represents the most scalable yet coarse-grained approach. Here, the query and the document are encoded independently into fixed-length vector embeddings using separate or shared neural encoders. Once encoded, these vectors are compared using lightweight similarity functions such as cosine similarity or dot product, enabling rapid retrieval using approximate nearest neighbor (ANN) search. This approach is widely used in large-scale systems due to its speed and efficiency. However, because the query and document tokens do not interact during encoding, the semantic alignment is relatively shallow, often missing finer contextual cues. This method is particularly useful for first-stage retrieval, where speed is prioritized over precision.

下图显示没有相互作用:

The following figure shows no interaction:

图示两个神经网络分别处理文档和查询,生成嵌入,并将输出组合起来以产生一个表示相似性或相关性的分数。

图 6.1:无相互作用

Figure 6.1: No interaction

完全互动

Full interaction

在另一个极端是交叉编码器,或称全交互模型。在这种方法中,查询和文档被连接起来并联合编码,使得每个查询词元都能通过交叉注意力机制与每个文档词元进行交互。由于模型能够跨词元对进行深度语义推理,因此这种设置能够产生极具表现力的表示和精确的相关性评分。然而,这种方法的缺点也很明显:每个文档-查询对都必须在推理时单独评估,这使得该方法对于从大型语料库中检索结果来说成本过高。交叉编码器通常用于对双编码器等较轻量级模型检索到的前k个候选结果进行重新排序。

At the other end of the spectrum lies the cross-encoder, or full interaction model. In this approach, the query and document are concatenated and jointly encoded, allowing every query token to interact with every document token via mechanisms like cross-attention. This setup yields highly expressive representations and precise relevance scores, as the model performs deep semantic reasoning across token pairs. However, the trade-off is substantial: Each document-query pair must be evaluated individually at inference time, making this method prohibitively expensive for retrieval from large corpora. Cross-encoders are often reserved for reranking the top-k candidates retrieved by lighter models like bi-encoders.

下图展示了完整的交互过程:

The following figure shows full interaction:

图示文档和查询通过并行神经网络层进行处理,并将输出组合起来生成最终分数。

图 6.2:完全交互

Figure 6.2: Full interaction

后期互动

Late interaction

诸如 ColBERT、ColPali 和 ColQwen 等后期交互模型提供了一种实用的折衷方案。与双向编码器类似,它们分别对查询和文档进行编码。然而,它们并非将表示折叠成单个向量,而是保留词元级嵌入。在检索过程中,使用诸如最大相似度( MaxSim ) 和最大余弦相似度等操作,对每个查询词元嵌入和所有文档词元嵌入进行细粒度比较。最终的相关性得分通过聚合这些词元级相似度来计算,通常使用词元最大得分的总和或平均值。

Late interaction models such as ColBERT, ColPali, and ColQwen offer a practical middle ground. Like bi-encoders, they encode queries and documents independently. However, instead of collapsing representations into a single vector, they retain token-level embeddings. During retrieval, a fine-grained comparison is performed between each query token embedding and all the document token embeddings using operations such as maximum similarity (MaxSim) maximum cosine similarity per token. The final relevance score is then computed by aggregating these token-level similarities, often using a sum or average of maximum scores across tokens.

这种设计实现了基于词元的匹配,而无需像完全注意力机制那样增加计算负担。此外,由于文档词元嵌入可以预先计算并存储(例如,存储在向量数据库中),这些模型在准确性和可扩展性之间实现了有效的平衡。值得注意的是,最近出现的变体,例如用于文本-图像融合的 ColPali 和用于集成 Qwen 等 LLM 的 ColQwen,进一步将后期交互扩展到多模态和生成式场景,其中来自视觉语言模型( VLM ) 或指令调整的 LLM 的嵌入被对齐到一个共享空间中,用于跨模态检索和重排序。

This design enables token-aware matching without the compute burden of full attention. Additionally, because document token embeddings can be precomputed and stored (e.g., in a vector database), these models offer an efficient compromise between accuracy and scalability. Notably, recent variants like ColPali (for text-image fusion) and ColQwen (for integrating LLMs like Qwen) further extend late interaction to multimodal and generative contexts, where embeddings from vision-language models (VLMs) or instruction-tuned LLMs are aligned in a shared space for cross-modal retrieval and reranking.

下图显示了后期交互作用:

The following figure shows late interaction:

图示查询和文档经过多个神经网络层处理,输出由 MaxSim 组合、求和并评分。

图 6.3:后期交互

Figure 6.3: Late interaction

选择无交互、延迟交互还是完全交互取决于应用场景。无交互侧重于速度和索引;完全交互侧重于准确性,但扩展性较差;延迟交互旨在兼顾无交互和完全交互的优点,既保留了丰富的词级语义,又具备实际的可扩展性,因此在现代人工智能系统的密集型和多模态检索流程中越来越受欢迎。

The choice between no interaction, late interaction, and full interaction hinges on the application context. No interaction favors speed and indexing; full interaction favors accuracy but scales poorly; late interaction aims for the best of both no interaction and full interaction by preserving rich token-level semantics with practical scalability, making it increasingly popular for dense and multimodal retrieval pipelines in modern AI systems.

多向量表示

Multi-vector representations

向量表示已成为现代信息检索系统的基石,它通过将文档和查询嵌入到高维连续空间中来实现语义搜索。传统的密集检索方法通常通过聚合词级嵌入将整个文档表示为一个单一向量。然而,这种方法往往会丢失细粒度的语义信息,尤其是在处理长文本或信息密集型文本时。

Vector representations have become the backbone of modern information retrieval systems, enabling semantic search by embedding documents and queries into high-dimensional continuous spaces. Traditional dense retrieval methods typically represent an entire document as a single vector by pooling token-level embeddings. However, this approach often loses fine-grained semantic information, particularly in the case of long or information-dense texts.

为了克服这一局限性,多向量表示法应运而生,它允许使用每个实体的多个向量(通常在词元或短语级别)来存储和查询文档。这种设计提高了检索精度,尤其是在需要精确词元级匹配的场景下。诸如 Qdrant 之类的现代向量数据库已经原生支持多向量表示法,为这种细粒度的检索机制提供了可扩展的基础架构。

To overcome this limitation, multi-vector representations have been introduced as a mechanism to store and query-document using multiple vectors per entity, often at the token or phrase-level. This design enhances retrieval precision, particularly in scenarios where exact token-level matching is required. Modern vector databases such as Qdrant have introduced native support for multi-vector representations, providing a scalable infrastructure for such fine-grained retrieval mechanisms.

图 6.4展示了一种多向量表示方法,该方法指的是为单个逻辑数据单元(例如文档或段落)存储多个向量。与典型的密集检索将所有词级嵌入压缩成单个池化表示不同,多向量方法为每个文档保留多个嵌入。这些向量通常源自基于 Transformer 的编码器的词级或短语级输出。

A multi-vector representation is shown in Figure 6.4, which refers to the practice of storing multiple vectors for a single logical unit of data, such as a document or paragraph. Instead of compressing all token-level embeddings into a single pooled representation (as in typical dense retrieval), the multi-vector approach retains multiple embeddings per document. These vectors are often derived from token-level or phrase-level outputs of transformer-based encoders.

句子表示方法对比图:左侧,一个句子用单个蓝色向量表示。右侧,同一个句子用三个不同深浅的向量表示,标记为“多个向量”。

图 6.4:单向与多向量表示

Figure 6.4: Single vs multi-vector representation

这种结构允许在查询时比较查询嵌入和每个文档的组成向量,从而实现更细致、更具上下文感知能力的检索。这对于重排序等任务尤其有利,因为这类任务的目标不仅是粗略检索,而是基于部分语义重叠进行细粒度评分。

This structure allows more nuanced and context-aware retrieval by enabling query time comparison between query embeddings and the constituent vectors of each document. This is particularly advantageous for tasks such as reranking, where the goal is not just coarse retrieval, but fine-grained scoring based on partial semantic overlaps.

Qdrant 为多向量表示提供了一流的支持,允许每个索引实体与多个命名向量场关联。每个向量场都可以独立配置其维度、相似度度量(例如,余弦相似度、点积)和索引策略。典型的配置涉及两个向量场:

Qdrant offers first-class support for multi-vector representations, allowing each indexed entity to be associated with multiple named vector fields. Each vector field can independently be configured with its own dimensionality, similarity metric (e.g., cosine, dot product), and indexing strategy. A typical configuration involves two vector fields:

  • 密集向量:它用于通过具有分层可导航小世界( HNSW ) 的 ANN 搜索进行第一轮检索。
  • Dense vector: It is used for first-pass retrieval via ANN search with hierarchical navigable small world (HNSW).
  • 多向量字段:用于重新排序,存储标记级向量而无需 HNSW 索引,以节省内存和计算开销。
  • Multi-vector field: Used for reranking, storing token-level vectors without HNSW indexing to save memory and computational overhead.

Qdrant 通过一种名为 MaxSim 的机制实现基于词元的重排序。MaxSim 是一种相似度比较器,它计算每个查询向量与文档向量集之间的 MaxSim 值。这种策略与后期交互模型中使用的重排序逻辑非常相似,并且可以通过Multi-vectorComparator.MAX_SIM设置进行配置。

Qdrant enables token-aware reranking through a mechanism known as MaxSim, a similarity comparator that computes the MaxSim between each query vector and the set of document vectors. This strategy closely mirrors the reranking logic used in late interaction models and can be configured through the Multi-vectorComparator.MAX_SIM setting.

简单地对 HNSW 图中的所有标记级向量进行索引会导致严重的性能瓶颈:

Naively indexing all token-level vectors in an HNSW graph leads to severe performance bottlenecks:

  • 由于需要维护大型图结构,导致内存使用量增加。
  • Increased RAM usage due to maintaining large graph structures.
  • 由于更新索引的复杂性,插入速度较慢。
  • Slow insert times because of the complexity of updating the index.
  • 由于词元级向量通常只在重新排序期间使用,而不是在初始检索期间使用,因此存在冗余计算。
  • Redundant compute, since token-level vectors are typically used only during reranking, not during initial retrieval.

为了解决这个问题,Qdrant 允许对多向量字段选择性地禁用 HNSW 索引。这种优化能够在不牺牲准确性的前提下实现快速数据摄取和轻量级重排序,因为初始检索步骤是通过密集向量字段处理的。

To address this, Qdrant allows HNSW indexing to be disabled selectively for multi-vector fields. This optimization enables fast ingestion and lightweight reranking without sacrificing accuracy, as the initial retrieval step is handled via dense vector fields.

与后期交互架构的差异

Differentiation from late interaction architectures

虽然像 Qdrant 这样的数据库中的多向量表示受到了后期交互模型的启发,但这两个概念在范围和作用上有着根本的不同。

While multi-vector representations in databases like Qdrant are inspired by late interaction models, the two concepts differ fundamentally in scope and role.

该表对 Qdrant 等向量数据库中实现的多向量表示和 ColBERT 等后期交互模型架构进行了比较分析。虽然两种方法都旨在利用词元级嵌入来提高检索准确率,但它们的范围、实现和功能却存在显著差异。多向量表示侧重于基础设施层,优化词元嵌入的存储、索引和重排序使用方式。相比之下,后期交互模型在模型层面(通常是在训练和推理阶段)定义嵌入生成和匹配策略。下表重点列出了它们在用途、使用场景、索引策略和系统依赖性方面的关键区别。此比较阐明了两者之间的互补关系,并强调了向量数据库在扩展基于后期交互的检索流程中的作用:

The table provides a comparative analysis between multi-vector representations as implemented in vector databases like Qdrant and late interaction model architectures such as ColBERT. While both approaches aim to leverage token-level embeddings for improved retrieval accuracy, their scope, implementation, and function differ significantly. Multi-vector representations focus on the infrastructure layer, optimizing how token embeddings are stored, indexed, and used during reranking. In contrast, late interaction models define the embedding generation and matching strategy at the model-level, typically during training and inference. The following table highlights key distinctions in their purpose, usage context, indexing strategies, and system dependencies. This comparison clarifies the complementary relationship between the two and underscores the role of vector databases in scaling late interaction-based retrieval pipelines:

方面

Dimension

多向量表示(Qdrant)

Multi-vector representations (Qdrant)

后期交互架构(例如,ColBERT)

Late interaction architectures (e.g., ColBERT)

定义

Definition

支持每个文档多个向量的存储和查询机制。

Storage and querying mechanism supporting multiple vectors per document.

在查询时执行细粒度相似性分析的模型架构。

Model architecture that performs fine-grained similarity at query time.

目的

Purpose

为服务令牌级表示而进行的基础设施优化。

Infrastructure optimization for serving token-level representations.

嵌入层面的语义建模与检索。

Semantic modeling and retrieval at the embedding level.

相似性函数

Similarity function

使用 MaxSim 或可配置的相似度指标进行重新排名。

Uses MaxSim or configurable similarity metrics for reranking.

通常固定为查询词和文档词之间计算的 MaxSim 值。

Typically fixed to MaxSim computed between query and document tokens.

索引策略

Indexing strategy

允许选择性地禁用多向量的索引。

Allows selective disabling of indexing for multi-vectors.

索引是外部的;文档嵌入被存储以进行匹配。

Indexing is external; document embeddings are stored for matching.

模型依赖性

Model dependency

可以使用任何输出多个嵌入向量的模型。

Can use any model outputting multiple embeddings.

需要特定的架构组件(例如,ColBERT 转换器层)。

Requires specific architectural components (e.g., ColBERT transformer layers).

使用语境

Usage context

生产级向量数据库的基础设施级支持。

Infrastructure-level support in production vector databases.

训练和推理过程中使用的算法设计。

Algorithmic design used during training and inference.

表 6.1:多向量与晚期交互作用的比较

Table 6.1: Comparison of multi-vector and late interaction

延迟交互是指一种保留词元级嵌入并在查询时执行交互的建模技术,而Qdrant中的多向量支持则是一种检索和存储机制,能够以高效且可扩展的方式部署此类模型。延迟交互模型生成嵌入;Qdrant的多向量基础架构则高效地存储和利用这些嵌入进行重排序。

Late interaction refers to the modeling technique that retains token-level embeddings and performs interaction at query time, whereas multi-vector support in Qdrant is a retrieval and storage mechanism that enables deployment of such models in an efficient and scalable manner. Late interaction models generate the embeddings; Qdrant’s multi-vector infrastructure stores and utilizes them efficiently for reranking.

多向量表示通过保留每个实体的多个嵌入向量,实现了细粒度的文档检索,这对于诸如 ColBERT 等现代重排序架构至关重要。Qdrant 对多向量字段和词级 MaxSim 评分的支持,为在生产环境中部署此类系统提供了可扩展的基础架构。虽然在概念上与后期交互相关,但多向量支持运行于基础架构层面,通过优化部署和性能特征,对模型层面的创新起到补充作用。

Multi-vector representations enable fine-grained document retrieval by preserving multiple embeddings per entity, a necessity for modern reranking architectures such as ColBERT. Qdrant’s support for multi-vector fields and token-level MaxSim scoring provides a scalable infrastructure for deploying such systems in production environments. While conceptually related to late interaction, multi-vector support operates at the infrastructure-level, complementing model-level innovations by optimizing their deployment and performance characteristics.

交互模型在两阶段 RAG 系统中的作用

Role of interaction models in two-stage RAG systems

在 RAG(评级、可用性、可寻址)的背景下,查询和文档表示之间的交互类型对检索效率和答案准确性起着决定性作用。一个典型的两阶段 RAG 系统通常包含一个初始检索阶段,用于选择候选文档;随后是一个重排序阶段,用于优化候选文档列表,从而提高生成响应的质量。交互的性质和深度(例如,无交互、延迟交互或完全交互)直接影响此类系统的架构设计和性能权衡。

In the context of RAG, the type of interaction between query and document representations plays a foundational role in determining both retrieval efficiency and answer accuracy. A two-stage RAG system typically involves an initial retrieval phase to select candidate documents, followed by a reranking phase that refines this list to improve the quality of the generated response. The nature and depth of interaction, whether no interaction, late interaction, or full interaction, directly impact both the architectural design and performance trade-offs of such systems.

检索阶段的互动

Interaction in the retrieval phase

RAG系统的第一阶段通常采用双编码器架构,也称为无交互模型。在这种方法中,查询和文档分别使用独立或共享的神经编码器编码成固定长度的向量表示。这些嵌入向量被存储并使用相似度函数(例如余弦相似度或点积)进行比较,通常通过人工神经网络(ANN)搜索来加速。由于运行时只需对查询进行编码,因此可以实现对大型语料库的可扩展、低延迟检索。然而,编码过程中缺乏跨词元注意力机制可能会限制语义粒度,导致复杂查询的检索精度降低。

The first-stage of a RAG system generally employs a bi-encoder architecture, also referred to as a no interaction model. In this approach, queries and documents are encoded independently into fixed-length vector representations using separate or shared neural encoders. These embeddings are stored and compared using similarity functions such as cosine similarity or dot product, often accelerated through ANN search. This allows for scalable, low-latency retrieval across large corpora, as only the query needs to be encoded at runtime. However, the lack of cross-token attention during encoding may limit semantic granularity, resulting in lower retrieval precision for complex queries.

使用各种交互模型进行重排序

Reranking with various interaction models

为了克服双编码器的局限性,RAG流程的第二阶段引入了重排序机制,对检索到的前几名候选模型进行重新评估和优先级排序。这使得具有后期交互或完全交互的模型尤为重要,详情如下:

To address the limitations of the bi-encoder, the second-stage of the RAG pipeline incorporates a reranking mechanism that re-evaluates and prioritizes the top retrieved candidates. This is where models with late interaction or full interaction are particularly relevant, details as follows:

  • 诸如 ColBERT、ColPali 和 ColQwen 等后期交互模型保留了查询和文档的词元级嵌入。在重排序过程中,模型计算细粒度的词元间相似度得分(例如,每个查询词元与所有文档词元之间的最大余弦相似度),从而能够更细致地评估相关性。虽然文档仍然独立嵌入且可以预先索引,但重排序操作引入了一种语义交叉注意力机制,与完全交互方法相比,这种机制在计算上更加高效。
  • Late interaction models, such as ColBERT, ColPali, and ColQwen, retain token-level embeddings for both queries and documents. During reranking, the model computes fine-grained token-to-token similarity scores (e.g., maximum cosine similarity between each query token and all document tokens), enabling a more nuanced assessment of relevance. While documents are still embedded independently and can be pre-indexed, the reranking operation introduces a form of semantic cross-attention that is computationally efficient compared to full interaction approaches.
  • 相比之下,全交互模型(交叉编码器)通过将查询和文档的词元序列连接起来,并输入到单个Transformer编码器中,从而联合处理查询和文档。这实现了完全的交叉注意力机制,使得每个查询词元都能关注到每个文档词元,从而达到最高级别的语义理解。虽然这种方法能够提供最准确的评分,但计算成本很高,并且必须在运行时单独评估每个查询文档对。因此,它仅适用于小型候选集,不适用于大规模的第一阶段检索。
  • In contrast, full interaction models (cross-encoders) process the query and document jointly by concatenating their token sequences and passing them through a single transformer encoder. This enables full cross-attention, where each query token can attend to every document token, allowing for the highest level of semantic understanding. Although this approach offers the most accurate scoring, it is computationally expensive, and we must evaluate each query document pair individually at runtime. Consequently, it is feasible only for small candidate sets and is unsuitable for large-scale first-stage retrieval.
  • 基于多向量表示的重排序通过实现查询和文档之间的词元级交互来提高检索精度,这对于复杂或细粒度的信息需求尤为重要。在传统的密集检索中,每个文档都由一个池化向量表示,这可能会掩盖细微的语义信号并降低区分能力。相比之下,多向量表示为每个文档保留多个嵌入向量——通常在词元或短语级别——从而保留局部语义信息。
  • Reranking with multi-vector representations enhances retrieval precision by enabling token-level interaction between queries and documents, a capability especially important for complex or fine-grained information needs. In traditional dense retrieval, each document is represented by a single pooled vector, which can obscure nuanced semantic signals and reduce discriminative power. In contrast, multi-vector representations retain multiple embeddings per document—typically at the token or phrase-level—preserving local semantic information.

在重排序过程中,初始检索阶段使用快速人工神经网络(ANN)在密集向量上搜索,筛选出候选文档列表。随后,通过比较每个候选文档,对其进行重新排序。它利用诸如 MaxSim 之类的相似度度量,将词元级向量映射到词元级查询嵌入。此过程能够更精确地捕捉特定查询词与相关文档部分之间的对应关系,从而提高排名质量。

During reranking, an initial retrieval stage selects a shortlist of candidate documents using fast ANN search over dense vectors. Subsequently, each candidate is reranked by comparing its token-level vectors to the token-level query embeddings using similarity measures such as MaxSim. This process captures more precise alignment between specific query terms and relevant document parts, resulting in improved ranking quality.

重要的是,基于多向量的重排序无需对单个词元级向量进行索引,这显著降低了内存开销并加快了文档导入速度。通过将粗粒度检索阶段与细粒度评分阶段解耦,多向量重排序为在生产系统中部署后期交互模型提供了一种可扩展的机制。这种混合架构兼顾了速度和检索精度,使其特别适用于高性能语义搜索和 RAG 应用。

Importantly, multi-vector-based reranking does not require indexing the individual token-level vectors, which significantly reduces memory overhead and accelerates document ingestion. By decoupling the coarse retrieval phase from the fine-grained scoring phase, multi-vector reranking provides a scalable mechanism for deploying late interaction models in production systems. This hybrid architecture delivers both speed and retrieval accuracy, making it especially suitable for high-performance semantic search and RAG applications.

集成到两阶段 RAG 架构中

Integration into two-stage RAG architectures

在实际的 RAG 架构中,双编码器用于从向量数据库中检索初始文档池(例如,前 100 个候选文档)。然后,这些候选文档会被传递给重排序器,重排序器会根据精度和延迟之间的权衡,采用延迟交互模型或完全交互模型。延迟交互模型提供了一种折衷方案,它支持可扩展的、与人工神经网络 (ANN) 兼容的存储,同时比纯双编码器方法提高了相关性。当精度至关重要且计算资源允许逐对处理时,完全交互模型是理想之选。

In practical RAG architectures, bi-encoders are used to retrieve an initial pool of documents (e.g., top 100 candidates) from a vector database. These candidates are then passed through a reranker that employs either late interaction or full interaction models, depending on the desired trade-off between precision and latency. Late interaction models offer a middle ground by supporting scalable, ANN-compatible storage while improving relevance over pure bi-encoder methods. Full interaction models are ideal when precision is paramount, and computational resources permit per-pair processing.

因此,理解和选择合适的交互范式对于设计有效的两阶段RAG系统至关重要。通过将检索和重排序阶段与合适的编码器架构相匹配,可以实现性能、可扩展性和准确性之间的最佳平衡。

Thus, understanding and selecting the appropriate interaction paradigm is essential for designing effective two-stage RAG systems. By aligning the retrieval and reranking stages with the appropriate encoder architectures, it is possible to achieve an optimal balance between performance, scalability, and accuracy.

下图展示了两阶段 RAG 架构:

The following figure displays the two-stage RAG architecture:

流程图显示了嵌入模型处理的查询、存储在向量数据库中的分块文档、向量搜索、结果重排序以及大型语言模型生成最终结果并返回给用户的整个过程。

图 6.5:两阶段 RAG 架构

Figure 6.5: Two-stage RAG architecture

两阶段 RAG 架构

Two-stage RAG architecture

RAG 系统通过检索和利用相关的外部文档来扩展 LLM 的功能。RAG 系统并非仅仅依赖模型内部的参数进行知识检索,而是显式地从索引语料库中获取文档,使生成式响应基于外部的、通常是最新的信息。当重排序机制被整合到这一流程中时,其架构就演化成通常所说的两阶段 RAG 系统。

RAG systems extend the capabilities of LLMs by retrieving and conditioning on relevant external documents. Instead of relying solely on the model's internal parameters for knowledge retrieval, RAG explicitly fetches documents from an indexed corpus to ground the generative response in external, and often up-to-date, information. When reranking mechanisms are incorporated into this pipeline, the architecture evolves into what is commonly termed a two-stage RAG system.

第一阶段密集检索

Stage one dense retrievals

该架构的第一阶段侧重于从大规模文档语料库中高效检索文档。通常,这通过密集向量检索来实现,其中查询和文档都被独立地编码成向量表示。这些向量使用人工神经网络(ANN)搜索技术被索引到向量数据库中。这使得检索能够实现可扩展且低延迟,从而产生大量语义相关的文档,通常包含前五十到前一百个候选文档。然而,该阶段缺乏在编码过程中查询和文档标记之间的显式交互。因此,尽管检索到的文档在主题上与查询相关,但它们的上下文对齐可能不够紧密,从而影响它们在下游生成任务中的效用。

The first-stage of this architecture focuses on efficient retrieval from large-scale document corpora. Typically, this is accomplished using dense vector retrieval, where both queries and documents are encoded independently into vector representations. These vectors are indexed in a vector database using ANN search techniques. This allows for scalable and low-latency retrieval, yielding a broad set of semantically relevant documents, often in the range of top fifty to top hundred candidates. However, this stage lacks explicit interaction between the query and the document tokens during encoding. As a result, although the retrieved documents are topically related to the query, their contextual alignment may be shallow, affecting their usefulness in downstream generation tasks.

第二阶段,语义精确度重排序

Stage-two, reranking for semantic precision

第二阶段引入了重排序组件,旨在优化第一阶段生成的候选文档列表。在此阶段,查询与每个候选文档之间应用了更丰富的交互机制。与初始检索不同,重排序模型考虑词元级关系,从而实现更深层次的语义对齐。这些模型可以采用词元级比较、注意力机制或部分交叉注意力结构,在不产生重新处理整个语料库的高昂计算成本的情况下,模拟完整的语义交互。第二次迭代为每个文档生成更准确的相关性得分,并据此重新排序候选文档列表。排名靠前的文档随后作为上下文输入传递给语言模型进行生成。

The second-stage introduces a reranking component designed to refine the shortlist produced in the first-stage. This is where richer interaction mechanisms are applied between the query and each candidate document. Unlike the initial retrieval, reranking models consider token-level relationships, enabling a deeper semantic alignment. These models may employ token-wise comparisons, attention mechanisms, or partial cross-attention structures that simulate full semantic interaction without incurring the high computational cost of reprocessing the entire corpus. This second pass produces a more accurate relevance score for each document and reorders the shortlist accordingly. The top-ranked documents are then forwarded as the contextual input to the language model for generation.

两阶段设计的战略作用

The strategic role of two-stage design

这种被称为快速检索的分阶段方法,之后是精确重排序,体现了可扩展性和准确性之间的战略性权衡。第一阶段确保对庞大的语料库进行快速探索,优先考虑召回率和系统响应速度。第二阶段确保仅使用上下文最相关的文档来调整语言学习模型(LLM),从而提高生成质量、事实一致性和主题相关性。如果没有这个重排序阶段,系统可能会将输出建立在关联性较弱或次优的文档之上,这会降低响应质量甚至产生错误结果。

This bifurcated approach, known as fast retrieval, is followed by precise reranking that embodies a strategic trade-off between scalability and accuracy. The first-stage ensures rapid exploration across a vast corpus, prioritizing recall and system responsiveness. The second-stage ensures that only the most contextually appropriate documents are used to condition the LLM, thereby improving generation quality, factual consistency, and topical relevance. Without this reranking stage, the system risks grounding its output in loosely related or suboptimal documents, which can degrade response quality or introduce hallucinations.

因此,重排序不仅仅是一个优化步骤,更是一种结构性增强,它在 RAG 流程中定义了两个既独立又相互依存的阶段:广度检索深度重排序。这种两阶段配置确保了检索效用与生成目标的一致性,并已成为现代高性能 RAG 系统(尤其是在开放域、企业级或高精度环境下运行的系统)的基础模式。

So, reranking is not merely an optimization step but a structural enhancement that defines two distinct yet interdependent stages in the RAG pipeline: retrieval for breadth and reranking for depth. This two-stage configuration ensures alignment between retrieval utility and generation goals, and has become a foundational pattern in modern, high-performance RAG systems, especially those operating in open-domain, enterprise, or high-precision contexts.

两阶段 RAG 与后期互动

Two-stage RAG vs. late interaction

ColBERT 和 ColPali 等后期交互模型的出现模糊了传统两阶段 RAG 架构与统一的、交互丰富的检索系统之间的界限。这些模型在双编码器的效率和交叉编码器的精确度之间提供了一个引人注目的平衡点。关键问题是,如果使用 ColBERT 或 ColPali 作为检索器,是否还需要单独的重排序阶段?

The emergence of late interaction models like ColBERT and ColPali has blurred the lines between traditional two-stage RAG architectures and unified, interaction-rich retrieval systems. These models offer a compelling middle ground between the efficiency of bi-encoders and the precision of cross-encoders. The key question is that if ColBERT or ColPali is used as the retriever, is a separate reranking stage still necessary?

ColBERT 和 ColPali 的功能

Capabilities of ColBERT and ColPali

与标准密集检索器(例如双编码器架构)不同,ColBERT 型模型不会将文档和查询压缩成单个向量。相反,它们保留词元级别的嵌入,并在检索过程中使用诸如 MaxSim 之类的操作来比较查询词元和文档词元之间的嵌入。这既保留了细粒度的语义信息,又可以通过对词元向量进行索引(例如,使用后期交互索引方案)来实现人工神经网络搜索。

Unlike standard dense retrievers (e.g., dual-encoder architectures), ColBERT-type models do not compress documents and queries into a single vector. Instead, they preserve token-level embeddings, which are compared during retrieval using operations like MaxSim between query and document tokens. This preserves fine-grained semantic information while still enabling ANN search via indexing of token vectors (e.g., using late interaction indexing schemes).

因此,ColBERTColPali在检索阶段就已经执行了一种复杂的重排序机制。其评分函数考虑了多个词元之间的交互,与传统的双编码器检索相比,语义对齐程度更高,但计算成本却远低于交叉编码器。实际上,这种后期交互机制起到了一种隐式重排序的作用,使检索阶段更加精确。

As a result, ColBERT and ColPali already perform a sophisticated form of reranking at retrieval time. The scoring function considers multiple token interactions, offering far more semantic alignment than traditional bi-encoder retrieval, but without the full computational cost of cross-encoders. In effect, this late interaction serves as an implicit reranking mechanism, making the retrieval stage more precise.

使用两阶段 RAG

Use of two-stage RAG

然而,尽管 ColBERT 型模型的表达能力有所提高,但在高风险或高度复杂的应用中,它们仍然可以从第二阶段或两阶段 RAG 算法中获益,该算法使用重排序器。原因包括:

However, despite their improved expressiveness, ColBERT-style models may still benefit from a second-stage or two-stage RAG, which uses a reranker in high-stakes or highly nuanced applications. The reasons include:

  • 更好的排名保真度:在某些基准测试中,完整的交互模型(交叉编码器)仍然优于后期交互模型,因为它们对全局标记依赖性进行建模,而不仅仅是 MaxSim 启发式方法。
  • Better ranking fidelity: Full interaction models (cross-encoders) still outperform late interaction models in certain benchmarks because they model global token dependencies, not just MaxSim heuristics.
  • 生成对齐:后期交互得分优化的是检索排名,而非下游生成质量。第二阶段重排序器可以更好地将检索得分与生成效用对齐。
  • Generation alignment: Late interaction scores optimize retrieval ranking, not necessarily the quality of downstream generation. A second-stage reranker can better align retrieval scores with generation utility.
  • 集成鲁棒性:两阶段管道允许组合不同的评分信号(例如,后期交互 + 生成损失 + 事实性评分)。
  • Ensemble robustness: A two-stage pipeline allows the combination of different scoring signals (e.g., late interaction + generative loss + factuality scoring).

在实际系统中(例如,企业级 RAG、长上下文检索、混合多模态设置),使用 ColBERT 生成高召回率短列表,然后使用交叉编码器重排序器,只关注前 k 个候选词,这种情况并不少见。

In practical systems (e.g., enterprise RAG, long-context retrieval, hybrid multimodal setups), it is not uncommon to use ColBERT for high-recall shortlist generation and follow it with a cross-encoder reranker that focuses only on the top-k candidates.

多阶段 RAG

Multi-stage RAG

RAG系统通常结合检索和生成,通过外部知识补充生成式语言模型,从而提供准确且与上下文相关的响应。虽然通常被描述为一个两阶段过程:首先检索相关文档,然后根据检索到的内容生成答案,但更高级的RAG实现通常涉及多个检索、过滤、重排序和验证阶段,统称为多阶段RAG。

RAG systems typically combine retrieval and generation to provide accurate and contextually relevant responses by supplementing generative language models with external knowledge. While commonly described as a two-stage process, retrieving relevant documents and subsequently generating answers from the retrieved content, advanced RAG implementations often involve multiple retrieval, filtering, reranking, and validation stages, collectively referred to as multi-stage RAG.

超越两阶段系统

Beyond two-stage systems

标准 RAG 实现方案包括:

Standard RAG implementations include:

  • 检索阶段:使用密集嵌入和向量相似度从知识库中检索相关文档或段落。
  • Retrieval stage: Relevant documents or passages are retrieved from a knowledge base using dense embeddings and vector similarity.
  • 生成阶段:生成模型综合检索到的信息,以提供连贯的、符合上下文的响应。
  • Generation stage: A generative model synthesizes retrieved information to provide coherent, contextually appropriate responses.

然而,实际应用往往需要额外的中间阶段来提升性能和准确性。这些阶段旨在解决查询歧义、检索结果冗余以及生成幻觉等问题。

However, real-world applications often demand additional intermediary stages to enhance performance and accuracy. These stages address challenges such as ambiguity in queries, redundancy in retrieved results, and the potential for generative hallucination.

多阶段 RAG 的组成部分

Components of multi-stage RAG

多阶段 RAG 架构通常包含以下附加步骤:

A multi-stage RAG architecture typically integrates additional steps such as:

  • 查询扩展和细化:在检索之前,可以使用预训练的语言模型或辅助知识库来扩展或细化查询,以提高检索准确率。
  • Query expansion and refinement: Prior to retrieval, queries can be expanded or refined using pre-trained language models or auxiliary knowledge bases to enhance retrieval accuracy.
  • 多模态检索:将文本嵌入与图像、视频或音频嵌入相结合,可以实现更丰富的语义检索,提供来自不同模态的上下文。
  • Multimodal retrieval: Combining textual embeddings with image, video, or audio embeddings enables richer semantic retrieval, providing context from diverse modalities.
  • 混合检索:结合多种检索机制,例如稀疏关键词检索(例如,最佳匹配 25 ( BM25 ))和密集语义检索,可确保全面覆盖和稳健性。
  • Hybrid retrieval: Incorporating multiple retrieval mechanisms, such as sparse keyword-based retrieval (e.g., Best Matching 25 (BM25)) alongside dense semantic retrieval ensures comprehensive coverage and robustness.
  • 重排序阶段:根据计算量更大的交叉编码器模型对检索到的文档或段落进行重排序,这些模型考虑查询和文档之间的细粒度交互,从而显著提高检索到的项目的相关性。
  • Reranking stage: Retrieved documents or passages are reranked based on more computationally intensive cross-encoder models that consider fine-grained interactions between queries and documents, significantly improving the relevance of retrieved items.
  • 验证和事实核查:实施专门的验证步骤,通常使用专门的模型或基于规则的启发式方法,通过在最终生成之前验证事实或过滤不可靠的来源来减少不准确之处。
  • Validation and fact-checking: Implementing a dedicated validation step, often using specialized models or rule-based heuristics, reduces inaccuracies by verifying facts or filtering unreliable sources before final generation.
  • 迭代反馈和改进:利用中间反馈回路,让生成模型的初步输出为额外的检索或改进提供信息,从而实现响应的动态迭代改进。
  • Iterative feedback and refinement: Utilizing intermediate feedback loops, where preliminary outputs from generative models inform additional retrieval or refinement, enables dynamic, iterative improvement of responses.

多阶段 RAG 的优势

Benefits of multi-stage RAG

在 RAG 架构中融入多个阶段可以带来以下几个优势:

Incorporating multiple stages into RAG architectures can offer several advantages:

  • 提高准确性和相关性:每个新增阶段都会逐步改进检索和生成过程,从而显著提高准确性并减少不相关或冗余信息。
  • Improved accuracy and relevance: Each added stage incrementally refines the retrieval and generation processes, significantly enhancing accuracy and reducing irrelevant or redundant information.
  • 减少幻觉:额外的验证阶段系统地识别和消除事实错误,从而提高生成内容的可靠性和可信度。
  • Reduction of hallucinations: Additional validation stages systematically identify and eliminate factual inaccuracies, thus increasing the reliability and credibility of generated content.
  • 增强上下文理解:多模态和混合检索能够实现更丰富、更细致的上下文解释,这在需要整合文本和视觉信息的复杂领域中尤其有价值。
  • Enhanced contextual understanding: Multimodal and hybrid retrieval allow for richer, nuanced contextual interpretation, especially valuable in complex domains requiring integration of textual and visual information.

多阶段 RAG 的类型

Types of multi-stage RAG

从上一节中,您应该已经了解到,RAG 系统通过结合检索和生成组件,整合外部知识源来增强语言模型。虽然标准的 RAG 框架通常包含两个阶段,即先检索后生成,但新兴的研究和实践表明,将 RAG 扩展到多阶段架构同样有效。这些高级配置克服了简单两阶段系统的局限性,能够以更高的准确率、更强的上下文理解能力和更强的适应性来处理复杂任务。

From the preceding section, you must have understood that RAG systems enhance language models by incorporating external knowledge sources through a combination of retrieval and generation components. While the canonical RAG framework typically involves two-stages, that is, retrieval followed by generation, emerging research and practical implementations demonstrate the efficacy of extending RAG into multi-stage architectures. These advanced configurations address limitations of simple two-stage systems and enable the handling of complex tasks with higher accuracy, contextual understanding, and adaptability.

以下列表概述了不同类型的多阶段 RAG:

The following list outlines different types of multi-stage RAGs:

  • 简单 RAG :它代表了一种基础架构,其中相关文档使用密集向量表示从知识库中检索,然后由语言模型使用这些文档生成答案。这种技术非常适合简单的问答场景,这些场景可以从补充上下文中受益,但不需要持续对话或复杂的推理。
  • Simple RAG: It represents the foundational architecture wherein the relevant documents are retrieved from a knowledge base using dense vector representations and are subsequently used by a language model to generate a response. This technique is well-suited for straightforward question answering scenarios that benefit from supplemental context but do not require ongoing dialogue or complex reasoning.
  • 带记忆功能的简单 RAG :此变体通过引入记忆机制增强了简单 RAG,使其能够在多次交互中保留上下文。带记忆功能的简单 RAG 在对话式人工智能中尤为重要,它能够保持查询之间的连续性,使系统能够解决共指问题和后续问题。有效地利用先前的对话轮次作为检索或生成过程的一部分,以确保上下文的连贯性。
  • Simple RAG with memory: This variant enhances simple RAG by incorporating memory mechanisms to retain context across multiple interactions. Particularly valuable in conversational AI, simple RAG with memory maintains continuity between queries, enabling the system to resolve co-references and follow-up questions effectively. The model leverages prior conversational turns as part of the retrieval or generation process to ensure contextual coherence.
  • 分支式 RAG :它引入了多个检索步骤,其中中间输出为后续检索提供信息。这种结构对于需要跨多个信息源进行多跳推理或综合的复杂查询至关重要。迭代检索机制逐步缩小相关内容的范围,从而提高最终响应的特异性和深度。
  • Branched RAG: It introduces multiple retrieval steps, where intermediate outputs inform subsequent retrievals. This structure is instrumental for complex queries that require multi-hop reasoning or synthesis across various information sources. The iterative retrieval mechanism progressively narrows down relevant content, improving the specificity and depth of the final response.
  • 假设文档嵌入HyDE ):HyDE 通过生成一个假设的理想文档来改进检索过程,该文档可能包含给定查询的答案。然后,将该生成的文档嵌入知识库并用作查询。当知识库可能不包含完全匹配的内容时,例如抽象或不明确的查询,HyDE 的优势在于能够引导检索朝着语义一致的内容进行。
  • Hypothetical Document Embedding (HyDE): HyDE modifies the retrieval process by generating a hypothetical ideal document that would likely contain the answer to a given query. This generated document is then embedded and used as a query against the knowledge base. HyDE proves beneficial in cases where the knowledge base may not contain exact matches, such as abstract or underspecified queries, by guiding retrieval toward semantically aligned content.
  • 自适应 RAG :它能根据输入查询的复杂性或类型动态调整检索和生成策略。它可以在稀疏或密集检索技术之间进行选择,并相应地选择不同的生成模型或配置。这种适应性使系统能够适应各种用户意图和数据域,从而提供更强大、更灵活的界面。
  • Adaptive RAG: It dynamically adjusts its retrieval and generation strategies based on the complexity or type of input query. It may choose between sparse or dense retrieval techniques and select different generation models or configurations accordingly. This adaptability enables the system to accommodate a wide spectrum of user intents and data domains, offering a more robust and flexible interface.
  • 纠错型红绿灯CRAG ):CRAG 引入了一个验证循环,将生成的输出与检索到的内容进行交叉验证。如果检测到差异,则通过额外的检索或纠错生成步骤来改进生成结果。这种方法在法律或医疗决策等高风险应用中尤为有用,因为在这些应用中,事实的准确性和可验证性至关重要。
  • Corrective RAG (CRAG): CRAG introduces a validation loop wherein generated outputs are cross-verified against retrieved content. If discrepancies are detected, the generation is refined through additional retrieval or corrective generation steps. This approach is particularly useful in high-stakes applications such as legal or medical decision-making, where factual accuracy and verifiability are critical.
  • 自反思 (Self-RAG ):它扩展了生成流程,使其具备自反思能力。在生成初始响应后,模型会评估自身的输出,识别潜在的不足,并进行进一步的检索以改进响应。这种自我改进循环与反思性和自洽推理的最新发展相一致,从而提升了整体输出的质量和深度。
  • Self-RAG: It extends the generation pipeline with self-reflection capabilities. After generating an initial response, the model assesses its own output, identifies potential weaknesses, and performs further retrieval to improve the response. This self-improvement cycle aligns with recent developments in reflective and self-consistent reasoning, enhancing overall output quality and depth.
  • 智能RAG :它将RAG与自主代理行为相结合。该系统能够进行规划、推理,并使用外部工具(例如应用程序编程接口( API )、计算器或数据库)来执行复杂的多步骤任务。这种范式非常适合解决需要决策、工作流编排或交互式工具使用的实际问题。
  • Agentic RAG: It integrates RAG with autonomous agent behavior. The system can plan, reason, and use external tools such as application programming interfaces (APIs), calculators, or databases to execute complex multi-step tasks. This paradigm is well-suited for real-world problem-solving in domains requiring decision-making, workflow orchestration, or interactive tool usage.

从简单架构到多阶段 RAG 架构的演进反映了人工智能系统向更具适应性、智能性和情境感知能力的方向发展的更广泛趋势。每种 RAG 变体都针对信息检索和自然语言生成( NLG ) 中的特定挑战,为从日常对话到高风险分析推理等各种应用提供定制化解决方案。随着 RAG 的不断成熟,混合和多智能体配置很可能在知识密集型人工智能工作流程中发挥越来越重要的作用。

The evolution from simple to multi-stage RAG architectures reflects a broader trend toward more adaptive, intelligent, and context-aware AI systems. Each RAG variant addresses specific challenges in information retrieval and natural language generation (NLG), offering tailored solutions for diverse applications ranging from casual conversation to high-stakes analytical reasoning. As RAG continues to mature, hybrid and multi-agent configurations are likely to play an increasingly prominent role in knowledge-intensive AI workflows.

评分机制

Grading mechanisms

在某些 RAG 变体中,评分机制至关重要,用于评估、排序或选择多个候选答案、检索结果或推理路径。这一评估层确保仅向用户呈现最符合上下文且事实最准确的输出。不同的 RAG 评分机制如下:

Grading mechanisms are integral in certain RAG variants to evaluate, rank, or select among multiple candidate responses, retrievals, or reasoning paths. This evaluative layer ensures that only the most contextually appropriate and factually accurate outputs are surfaced to the user. Different RAG grading mechanisms are as follows:

  • 智能体 RAG :这类系统尤其受益于评分机制。鉴于其在决策、工具调用和多步骤规划方面的自主性,这些系统通常会生成多个假设、行动计划或中间输出。评分模块(通常基于已学习的评分函数或提示式评估器)随后会根据连贯性、事实一致性、相关性或特定任务标准来评估这些选项。此过程有助于在最终生成或工具执行之前选择最优方案或响应。
  • Agentic RAG: These systems particularly benefit from grading. Given their autonomy in decision-making, tool invocation, and multi-step planning, these systems often generate multiple hypotheses, action plans, or intermediate outputs. A grading module, often based on a learned scoring function or prompt-based evaluator, is then employed to assess these options based on coherence, factual consistency, relevance, or task-specific criteria. This process helps in selecting the optimal plan or response before final generation or tool execution.
  • 自我评估:该模型可能还会在自我反思阶段加入评分环节。在初始生成之后,模型可能会生成备选版本或修正版本。评分员会评估这些变体,以确定哪个版本最能准确地解答用户的问题,从而减少错误并提高输出的准确性。
  • Self-RAG: It may also incorporate grading during the self-reflection phase. After an initial generation, the model may produce alternative versions or corrections. A grader assesses these variations to determine which revision most accurately addresses the user’s query, reducing hallucinations and increasing output fidelity.
  • 自适应 RAG :评分在自适应 RAG 系统中扮演着辅助但日益重要的角色。鉴于自适应 RAG 会根据查询特征(例如复杂性、领域或歧义性)动态选择检索和生成策略,评分机制可以作为元控制器,对候选策略进行评分或排序。这使得系统能够评估哪种检索方法(例如,稀疏检索与密集检索)、检索器深度或生成配置的组合能够针对给定上下文产生最相关、最可靠的输出。在实际应用中,这种评分可以依赖于置信度评分、检索质量启发式方法,甚至基于 LLM 的评估器。通过引入轻量级的评分函数,自适应 RAG 确保策略选择不仅是被动的,而且是基于证据的,从而提高了对各种查询类型的精确度和泛化能力。
  • Adaptive RAG: Grading plays a supportive yet increasingly important role in adaptive RAG systems. Given that adaptive RAG dynamically selects retrieval and generation strategies based on query characteristics, such as complexity, domain, or ambiguity, a grading mechanism can serve as a meta-controller that scores or ranks candidate strategies. This allows the system to evaluate which combination of retrieval method (e.g., sparse vs. dense), retriever depth, or generation configuration yields the most relevant and reliable output for a given context. In practical implementations, this grading may rely on confidence scoring, retrieval quality heuristics, or even LLM-based evaluators. By incorporating a lightweight grading function, adaptive RAG ensures that strategy selection is not just reactive but evidence-based, thus enhancing both precision and generalization across heterogeneous query types.
  • CRAG :评分是迭代改进过程中不可或缺的一部分。在生成初始版本后,系统会将其与检索到的文档进行比对,以识别不一致之处、虚构内容或事实错误。这一评估步骤起到评分机制的作用,可以是基于规则的(例如,关键词匹配、蕴含关系检查),也可以是基于模型的(例如,使用独立的逻辑逻辑模型对事实一致性进行评分)。系统会为输出结果赋予一个相关性或准确性评分,如果低于预设阈值,则会通过额外的检索、改写或选择性重新生成来修正生成的版本。因此,评分机制使模型能够实现维护事实完整性的反馈循环,这使其尤其适用于错误信息代价高昂的领域。
  • CRAG: Grading is intrinsic to the process of iterative refinement. After an initial generation is produced, the system evaluates it against retrieved documents to identify inconsistencies, hallucinations, or factual mismatches. This evaluative step functions as a grading mechanism, either rule-based (e.g., keyword matching, entailment checks) or model-based (e.g., a separate LLM scoring factual consistency). The output is assigned a relevance or accuracy score, and if it falls below a defined threshold, the generation is corrected through additional retrieval, rephrasing, or selective regeneration. Thus, grading enables the model to operationalize feedback loops that uphold factual integrity, making it especially suitable for domains where the cost of misinformation is high.

从本质上讲,评分将 RAG 从纯粹的生成过程转变为更具深思熟虑和评估性的流程,与反思型和多智能体人工智能系统的最新趋势相一致。

In essence, grading transforms RAG from a purely generative process into a more deliberative and evaluative pipeline, aligning with recent trends in reflective and multi-agent AI systems.

挑战与考量

Challenges and considerations

尽管多级 RAG 系统具有诸多优势,但它们也引入了复杂性和计算开销。关键考虑因素包括:平衡性能提升与延迟增加之间的关系、高效管理资源分配以及优化级间通信以避免瓶颈。

Despite their advantages, multi-stage RAG systems introduce complexity and computational overhead. Critical considerations include balancing performance gains against latency increases, managing resource allocation efficiently, and optimizing inter-stage communication to avoid bottlenecks.

多阶段 RAG 架构相比传统的两阶段模型有了显著的进步。通过策略性地加入额外的检索、优化和验证步骤,这些复杂的系统更适合高风险的真实世界应用,在这些应用中,准确性、可靠性和上下文理解至关重要。

Multi-stage RAG architectures represent a significant advancement over traditional two-stage models. By strategically incorporating additional retrieval, refinement, and validation steps, these sophisticated systems are better suited for high-stakes, real-world applications where accuracy, reliability, and contextual comprehension are paramount.

多阶段 RAG 系统中的代币利用率

Token utilization in multi-stage RAG systems

在多阶段 RAG 系统的设计和部署中,令牌利用率是一个至关重要的考虑因素。检索、重排序、验证和生成等每个阶段都会消耗一部分可用令牌预算,而可用令牌预算又受到底层语言模型上下文窗口的限制。高效的令牌预算分配直接影响响应的准确性和系统部署的成本效益。

Token utilization is a critical consideration in the design and deployment of multi-stage RAG systems. Each stage of retrieval, reranking, validation, and generation consumes a portion of the available token budget, which is constrained by the context window of the underlying language model. Efficient token budgeting directly impacts both the fidelity of responses and the cost-effectiveness of system deployment.

在多阶段流水线中,代币使用量通常会因以下原因而增加:

In multi-stage pipelines, token usage typically escalates due to:

  • 扩展的输入上下文:中间阶段,例如内存增强、分支检索和多跳查询,增加了传递给生成器的文档或提示的数量。
  • Expanded input contexts: Intermediate stages, such as memory augmentation, branched retrieval, and multi-hop queries, increase the number of documents or prompts passed into the generator.
  • 中间总结或评分:Self-RAG、CRAG 和 agentic RAG 通常会执行额外的步骤,对候选输出进行评分或重新编码,这需要进一步的标记消耗。
  • Intermediate summarization or scoring: Self-RAG, CRAG, and agentic RAG often perform additional passes where candidate outputs are scored or re-encoded, requiring further token expenditure.
  • 长格式生成:多模态输入或多代理计划可能导致生成更长的输出,从而增加下游代币消耗。
  • Long-form generation: Multimodal inputs or multi-agent plans can lead to longer generated outputs, adding to downstream token consumption.

因此,代币分配必须进行策略性管理。一些技巧包括选择性地截断低排名文档、通过摘要模型进行压缩,或者采用分层排名系统,尽可能减少代币密集型步骤(除非必要)。高级配置则使用路由或分级机制来确定管道的哪些分支值得投入更多代币。

Token allocation must therefore be strategically managed. Some techniques include selective truncation of low-ranking documents, compression via summarization models, or tiered ranking systems that minimize token-intensive steps unless necessary. Advanced configurations use routing or grading mechanisms to determine which branches of the pipeline warrant deeper token investment.

最终,在多阶段 RAG 系统中,令牌优化不仅对计算效率至关重要,而且对在令牌约束下保持模型精度也至关重要。精心管理令牌流能够设计出可扩展、高精度的 RAG 架构,适用于企业级部署。

Ultimately, token optimization in multi-stage RAG systems is essential not only for computational efficiency but also for preserving model accuracy within the token constraints. Thoughtful management of token flow enables the design of scalable, high-precision RAG architectures suitable for enterprise deployment.

评分类型

Grading types

在 RAG 流程中,即使初始检索结果看似足够,第二阶段仍然至关重要。尤其是在检索到的文档可能仅部分相关、包含噪声或模糊内容,或者需要额外推理才能确定其有效性的情况下,第二阶段更是如此。第二阶段引入了细化、过滤或验证机制(通常由基于 LLM 的评分器驱动),以确保只有上下文相关的文档才会被传递给生成模块。

In RAG pipelines, a second-stage remains essential even when initial retrieval appears sufficient. This is particularly true in scenarios where retrieved documents may be only partially relevant, contain noisy or ambiguous content, or require additional reasoning to determine their usefulness. The second-stage introduces refinement, filtering, or validation mechanisms—typically powered by LLM-based graders—that help ensure only contextually aligned documents are passed to the generation module.

以下概述了此类第二阶段设置中常用的一种分级组件:

The following outlines a common grading component used in such second-stage setups:

  • 检索相关性评分器
    • 目的和作用:检索相关性评分器是多阶段 RAG 流程中的第一层验证。其主要功能是评估从知识库中检索到的文档在语义上是否与给定的用户查询相关。此相关性检查至关重要,因为它决定了下游生成模型是否基于相关的上下文。
    • 方法论:评分器是一个基于逻辑逻辑模型(LLM)的二元分类器,它以用户提出的问题和检索到的文档为输入。评分器评估文档与查询的词汇和语义一致性。如果文档包含与问题相关的术语、概念或信息,则认为该文档与问题相关。
    • 提示设计:提示提供了一个结构化的输入框,其中包括:
      • 检索到的文档的全部内容
      • 原始用户查询
      • 客观判断文件是否包含至少一些相关信息的说明

      LLM 预期返回一个包含单个键的 JSON 输出:

      • binary_score,其值为“yes”(相关)或“no”(不相关)
    • 评估形式

      {

      "binary_score": "是"

      }

    • 代码

      ### 检索评分器

      # 文档评分器说明

      doc_grader_instructions = """您是一名评分员,正在评估检索到的文档与用户问题的相关性。

      如果文档包含与问题相关的关键词或语义信息,则将其评为相关。

      # 评分提示

      doc_grader_prompt = """以下是检索到的文档:\n\n{document}\n\n以下是用户问题:\n\n{question}。

      这会仔细、客观地评估该文件是否包含至少一些与问题相关的信息。

      返回一个 JSON 对象,其中包含一个键 binary_score,该键的值为“yes”或“no”,用于指示文档是否包含至少一些与问题相关的信息。

      # 测试

      问题:“什么是思维链引发的?”

      docs = retriever.invoke(question)

      doc_txt = docs[1].page_content

      doc_grader_prompt_formatted = doc_grader_prompt.format(

      文档=doc_txt,问题=问题

      result = llm_json_mode.invoke(

      [SystemMessage(content=doc_grader_instructions)]

      + [HumanMessage(content=doc_grader_prompt_formatted)]

      json.loads(result.content)

  • Retrieval relevance grader:
    • Purpose and role: The retrieval relevance grader is the first layer of validation in a multi-stage RAG pipeline. Its primary function is to evaluate whether a retrieved document from a knowledge base is semantically relevant to a given user query. This relevance check is foundational, as it determines whether the downstream generative model will be grounded in a pertinent context.
    • Methodology: The grader is formulated as an LLM-based binary classifier, prompted with both the user question and a retrieved document. It assesses the document for lexical and semantic alignment with the query. If the document contains terms, concepts, or information that pertain to the question, it is considered relevant.
    • Prompt design: The prompt provides a structured input that includes:
      • The full content of the retrieved document.
      • The original user queries.
      • Instructions to objectively determine whether the document contains at least some relevant information.

      The LLM is expected to return a JSON output with a single key:

      • binary_score, whose value is either "yes" (relevant) or "no" (not relevant).
    • Evaluation format:

      {

      "binary_score": "yes"

      }

    • Code:

      ### Retrieval Grader

      # Doc grader instructions

      doc_grader_instructions = """You are a grader assessing relevance of a retrieved document to a user question.

      If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant."""

      # Grader prompt

      doc_grader_prompt = """Here is the retrieved document: \n\n {document} \n\n Here is the user question: \n\n {question}.

      This carefully and objectively assess whether the document contains at least some information that is relevant to the question.

      Return JSON with single key, binary_score, that is 'yes' or 'no' score to indicate whether the document contains at least some information that is relevant to the question."""

      # Test

      question = "What is Chain of thought prompting?"

      docs = retriever.invoke(question)

      doc_txt = docs[1].page_content

      doc_grader_prompt_formatted = doc_grader_prompt.format(

      document=doc_txt, question=question

      )

      result = llm_json_mode.invoke(

      [SystemMessage(content=doc_grader_instructions)]

      + [HumanMessage(content=doc_grader_prompt_formatted)]

      )

      json.loads(result.content)

  • 幻觉检测分级器
    • 目的和作用:幻觉检测评分器发挥着关键的验证作用,它判断生成的答案是否基于在 RAG 流程中检索到的源文档的事实依据。在 GenAI 中,幻觉指的是证据集中不存在的捏造或缺乏依据的信息。该评分器旨在最终呈现给用户之前过滤掉此类信息。
    • 方法论:评分器需要输入两个信息:
      • 管道早期检索到的一系列事实文件
      • 由语言模型生成的学生答案

      法学硕士(LLM)需核实答案是否严格遵循所提供文件的内容,不得引入外部信息。评分过程强调解释驱动型评分,要求进行理性判断,而非简单的二元选择。

    • 提示设计:提示内容包括:
      • 一个标有“事实”(已检索文档)的章节
      • 学生答案(模型生成)的标注部分
      • 判断接地感和幻觉的明确标准

      预期输出是一个包含以下内容的 JSON 对象:

      • binary_score:'yes'(完全正常)或'no'(包含幻觉)
      • 解释:对评分决定的逐步论证
    • 评估形式

      {

      "binary_score": "否",

      “解释”:“答案中引入了文档中未包含的模型……”

      }

    • 代码

      ### 幻觉评分器

      # 幻觉评分器说明

      幻觉评分器说明 = """

      你是一名老师,正在批改测验卷。

      你会得到事实和学生的答案。

      以下是评分标准:

      确保学生的答案有事实依据。

      确保学生答案中不包含超出事实范围的“臆想”信息。

      分数:

      得分“是”表示学生的答案符合所有标准。这是最高分(最佳分数)。

      得分“否”表示学生的答案不符合所有标准。这是最低分。

      请逐步解释你的推理过程,以确保你的推理和结论是正确的。

      避免一开始就直接给出正确答案。

      # 评分提示

      hallucination_grader_prompt = """事实:\n\n{documents}\n\n学生答案:{generation}。

      返回一个包含两个键的 JSON 数据:`binary_score` 键表示学生答案是否基于事实,值为 `'yes'` 或 `'no'`;`explanation` 键则包含对答案的解释。

      # 使用上述文档和生成方法进行测试

      hallucination_grader_prompt_formatted = hallucination_grader_prompt.format(

      documents=docs_txt,generation=generation.content

      result = llm_json_mode.invoke(

      [SystemMessage(content=hallucination_grader_instructions)]

      + [HumanMessage(content=hallucination_grader_prompt_formatted)]

      json.loads(result.content)

  • Hallucination detection grader:
    • Purpose and role: The hallucination detection grader serves a critical verification function by determining whether a generated answer is factually grounded in the source documents retrieved during the RAG process. Hallucination in GenAI refers to fabricated or unsupported information not present in the evidence set. This grader aims to filter such artifacts before the final presentation to the user.
    • Methodology: The grader is prompted with two inputs:
      • A corpus of factual documents retrieved earlier in the pipeline
      • A student-generated answer produced by the language model

      The LLM is instructed to verify whether the answer adheres strictly to the content in the provided documents, without introducing external information. The process emphasizes explanation-driven grading, requiring a reasoned judgment rather than a simple binary decision.

    • Prompt design: The prompt includes:
      • A labeled section for FACTS (retrieved documents)
      • A labeled section for STUDENT ANSWER (model generation)
      • Clear criteria for judging grounding and hallucination

      The expected output is a JSON object containing:

      • binary_score: "yes" (fully grounded) or "no" (contains hallucination)
      • explanation: a step-by-step justification for the grading decision
    • Evaluation format:

      {

      "binary_score": "no",

      "explanation": "The answer introduces models not found in the document..."

      }

    • Code:

      ### Hallucination Grader

      # Hallucination grader instructions

      hallucination_grader_instructions = """

      You are a teacher grading a quiz.

      You will be given FACTS and a STUDENT ANSWER.

      Here is the grade criteria to follow:

      Ensure the STUDENT ANSWER is grounded in the FACTS.

      Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

      Score:

      A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.

      A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.

      Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.

      Avoid simply stating the correct answer at the outset."""

      # Grader prompt

      hallucination_grader_prompt = """FACTS: \n\n {documents} \n\n STUDENT ANSWER: {generation}.

      Return JSON with two two keys, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER is grounded in the FACTS. And a key, explanation, that contains an explanation of the score."""

      # Test using documents and generation from above

      hallucination_grader_prompt_formatted = hallucination_grader_prompt.format(

      documents=docs_txt, generation=generation.content

      )

      result = llm_json_mode.invoke(

      [SystemMessage(content=hallucination_grader_instructions)]

      + [HumanMessage(content=hallucination_grader_prompt_formatted)]

      )

      json.loads(result.content)

  • 答案质量评分器
    • 目的和作用:答案质量评分器评估生成的回答是否有效地回答了用户的原始问题。与侧重于事实一致性的幻觉评分器不同,此评分器评估语义效用,即答案是否有助于实现用户的意图。
    • 方法论:该评分器将用户的问题与标准答案进行比较,以确定回答是否充分、信息丰富且符合语境。评分标准允许在回答中添加额外信息,只要这些信息有助于回答问题即可。
    • 提示设计:评分提示包含:
      • 最初的问题
      • 学生答案
      • 强调对齐和完整性的说明

      输出结果是一个结构化的 JSON 对象,包含以下内容:

      • binary_score:如果答案有助于回答问题,则为“yes”,否则为“no”。
      • 解释:支持该评分的详细理由。

      该评分器有助于区分模糊、无意义的答案和提供实质性、相关见解的答案。

    • 评估形式

      {

      "binary_score": "是",

      “解释”:“答案清楚地阐述了 Llama 3.2 的视觉模型,并将其与问题联系起来。”

      }

    • 代码

      ### 答案评分器

      # 回答评分员说明

      answer_grader_instructions = """您是一名正在批改测验的老师。

      你会看到一个问题和一个学生答案。

      以下是评分标准:

      (1)学生的答案有助于回答问题。

      分数:

      得分“是”表示学生的答案符合所有标准。这是最高分(最佳分数)。

      如果答案包含题目中未明确要求的额外信息,则学生可以获得“是”的分数。

      得分“否”表示学生的答案不符合所有标准。这是最低分。

      请逐步解释你的推理过程,以确保你的推理和结论是正确的。

      避免一开始就直接给出正确答案。

      # 评分提示

      answer_grader_prompt = """问题:\n\n{question}\n\n学生答案:{generation}。

      返回一个包含两个键的 JSON 数据:`binary_score` 键,值为“yes”或“no”,用于指示学生答案是否符合标准;以及 `explanation` 键,用于解释得分原因。

      # 测试

      问题:“今天作为 Llama 3.2 的一部分发布的愿景模型有哪些?”

      答案 = “今天发布的 Llama 3.2 型号包括两款视觉型号:Llama 3.2 11B Vision Instruct 和 Llama 3.2 90B Vision这些模型可通过托管计算在 Azure AI 模型目录中获取。它们是 Meta 首次涉足多模态 AI 领域,在视觉推理方面可与 Anthropic 的 Claude 3 Haiku 和 OpenAI 的 GPT-4o mini 等封闭模型相媲美。它们取代了较早的仅支持文本的 Llama 3.1 模型。

      # 使用上述问题和生成器进行测试

      answer_grader_prompt_formatted = answer_grader_prompt.format(

      问题=问题,生成=答案

      result = llm_json_mode.invoke(

      [SystemMessage(content=answer_grader_instructions)]

      + [HumanMessage(content=answer_grader_prompt_formatted)]

      json.loads(result.content)

  • Answer quality grader:
    • Purpose and role: The answer quality grader evaluates whether the generative response meaningfully addresses the user’s original question. Unlike the hallucination grader, which focuses on factual alignment, this grader assesses semantic utility, i.e., whether the answer contributes to resolving the user’s intent.
    • Methodology: This grader compares the user’s question against the model’s answer to determine if the response is sufficient, informative, and contextually appropriate. The criteria allow for extra information in the response, provided it helps answer the question.
    • Prompt design: The grader prompt contains:
      • The original QUESTION
      • The STUDENT ANSWER
      • Instructions that highlight both alignment and completeness

      The output is a structured JSON object with:

      • binary_score: "yes" if the answer helps answer the question, "no" otherwise
      • explanation: a detailed rationale supporting the score

      This grader helps differentiate between vague, uninformative answers and those that provide substantial, relevant insights.

    • Evaluation format:

      {

      "binary_score": "yes",

      "explanation": "The answer clearly states the Llama 3.2 vision models and relates them to the question."

      }

    • Code:

      ### Answer Grader

      # Answer grader instructions

      answer_grader_instructions = """You are a teacher grading a quiz.

      You will be given a QUESTION and a STUDENT ANSWER.

      Here is the grade criteria to follow:

      (1) The STUDENT ANSWER helps to answer the QUESTION

      Score:

      A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.

      The student can receive a score of yes if the answer contains extra information that is not explicitly asked for in the question.

      A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.

      Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.

      Avoid simply stating the correct answer at the outset."""

      # Grader prompt

      answer_grader_prompt = """QUESTION: \n\n {question} \n\n STUDENT ANSWER: {generation}.

      Return JSON with two two keys, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER meets the criteria. And a key, explanation, that contains an explanation of the score."""

      # Test

      question = "What are the vision models released today as part of Llama 3.2?"

      answer = "The Llama 3.2 models released today include two vision models: Llama 3.2 11B Vision Instruct and Llama 3.2 90B Vision Instruct, which are available on Azure AI Model Catalog via managed compute. These models are part of Meta's first foray into multimodal AI and rival closed models like Anthropic's Claude 3 Haiku and OpenAI's GPT-4o mini in visual reasoning. They replace the older text-only Llama 3.1 models."

      # Test using question and generation from above

      answer_grader_prompt_formatted = answer_grader_prompt.format(

      question=question, generation=answer

      )

      result = llm_json_mode.invoke(

      [SystemMessage(content=answer_grader_instructions)]

      + [HumanMessage(content=answer_grader_prompt_formatted)]

      )

      json.loads(result.content)

您可以根据 RAG 流程的阶段和评估目标(例如,正确性、事实性、相关性、流畅性)设计多种类型的评分器。请参考下表,其中描述了各种评分器类型、其用途、适用场景和示例:

You can design multiple types of graders depending on the stage of your RAG pipeline and the evaluation goals (e.g., correctness, factuality, relevance, fluency). Refer to the following table, as it describes comprehensive grader types, their purpose, when to use them, and examples:

分级机类型

Grader type

目的

Purpose

何时使用

When to use

示例用例

Example use case

检索相关性评分器

Retrieval relevance grader

检查检索到的文档是否与查询相关。

Check if the retrieved document is relevant to the query.

检索之后,生成之前。

After retrieval, before generation.

确保只有与迁移学习相关的文档才能提交给法学硕士项目。

Ensure only documents relevant to what is transfer learning? Are passed to LLM.

幻觉评分员

Hallucination grader

检查生成的答案是否基于检索到的事实。

Check if the generated answer is grounded in retrieved facts.

生成之后,最终输出之前。

After generation, before the final output.

验证 LLM 生成的关于 Llama 3.2 的答案是否与检索到的文档相符。

Verify if the LLM-generated answer about Llama 3.2 matches the retrieved documents.

答案质量评分器

Answer quality grader

判断生成的答案是否有效地回答了问题。

Judge whether the generated answer meaningfully addresses the query.

展出前的最终评估阶段。

Final evaluation stage before display.

判断答案是否正确且有效地解释了梯度下降的工作原理。

Decide if the answer explains how does gradient descent works? Correctly and helpfully.

忠诚度评级器

Faithfulness grader

与幻觉评分器类似,但侧重于逻辑一致性。

Similar to the hallucination grader, but focused on logical alignment.

当LLM可能推断出未明确说明的结论时使用。

Use when LLM may infer unstated conclusions.

检查有关因果推断的答案背后的推理是否有来源支持。

Check if the reasoning behind an answer about causal inference is backed by sources.

完整性评分器

Completeness grader

判断答案是否涵盖了问题的所有必要子部分。

Judge if the answer covers all required sub-parts of the question.

对于多部分或复合问题。

For multi-part or compound questions.

评估GPT 模型的优势和风险是什么?

Evaluate the answer to what are the benefits and risks of GPT models?

连贯性评分器

Coherence grader

检查答案在逻辑和语法上是否正确。

Checks whether the answer is logically and grammatically well-formed.

当你想确保可读性和清晰度时。

When you want to ensure readability and clarity.

确保长篇答案逻辑清晰、条理分明,没有矛盾或漏洞。

Ensure that a long-form answer flows well and has no contradictions or gaps.

毒性或偏见分级器

Toxicity or bias grader

检测有害、有偏见或不恰当的内容。

Detect harmful, biased, or inappropriate content.

出于安全考虑,在部署或展示之前。

For safety, before deployment or display.

过滤掉答案中有关种族、性别和政治的偏见性言论。

Filter out biased statements in answers about race, gender, and politics.

简洁性评分器

Conciseness grader

请确保答案简洁明了,不要偏离主题。

Ensure the answer is not verbose or off-topic.

当您需要简短的答案时(例如,用于摘要或移动设备)。

When you want short answers (e.g., for summaries or mobile use).

将关于量子计算的答案精简到50字以内。

Trim the answer about quantum computing to fit within 50 words.

一致性评分员

Consistency grader

检查类似问题的答案是否一致。

Check if answers to similar questions are consistent.

用于评估多轮或批量输出。

For evaluating multi-turn or batch outputs.

确保“什么是人工智能?”这个问题的答案人工智能的定义保持一致。

Ensure that answers to what is AI? and the definition of AI are aligned.

遵循指示的评分器

Instruction-following grader

评估对具体指示或限制的遵守情况。

Evaluate adherence to specific instructions or constraints.

当提示包含自定义说明时(例如,只列出三点)。

When prompts contain custom instructions (e.g., list three points only).

检查LLM是否遵循说明,例如使用项目符号避免使用数学符号。

Check if LLM follows instructions like using bullet points or avoiding math symbols.

证据归因评分器

Evidence attribution grader

检查答案的来源是否正确引用。

Check if the source of an answer is cited properly.

适用于知识密集型质量保证或学术应用。

For knowledge-intensive QA or academic applications.

确保关于研究论文的答案包含类似(Smith et al. ,2022)的引用。

Ensure the answer about a research paper includes a citation like (Smith et al., 2022).

表 6.2:RAG 管道中的分级机类型

Table 6.2: Types of graders in RAG pipelines

实现多阶段 RAG 工作流程及路由

Implementation of multi-stage RAG workflow with routing

Chapter_6.ipynb包含多个代码实现,这些实现使用了 LangChain、LangGraph 和集成了 Ollama 的本地语言学习模型(例如 Llama 3.2),构建了多种多阶段、检索增强型问答系统。该系统可以从预嵌入的本地向量存储或通过实时网络搜索检索信息,智能地路由查询,生成答案,然后对输出结果的质量和相关性进行评分。

In the Chapter_6.ipynb there are multiple code implementations of a various multi-stage, retrieval-augmented question answering systems using LangChain, LangGraph, and Ollama-integrated local LLMs (e.g., Llama 3.2). The system retrieves information either from a pre-embedded local vector store or via a live web search, routes queries intelligently, performs generation, and then grades the output for quality and relevance.

它首先使用NomicEmbeddings嵌入特定领域的文档(例如,关于智能体和对抗攻击的博客文章) ,并将它们存储在SKLearnVectorStore中。问题通过 JSON 模式的路由模型进行路由,该模型决定是使用向量存储还是通过 Tavily API 进行网络搜索。

It begins by embedding domain-specific documents (e.g., blog posts on agents and adversarial attacks) using NomicEmbeddings and storing them in an SKLearnVectorStore. Questions are routed through a JSON-mode router model that determines whether to use the vector store or web search via the Tavily API.

文档检索完成后,检索评分器会检查其与查询的相关性。如果文档相关,系统会调用基于提示的红黄绿(RAG)生成器。生成的答案会通过幻觉评分器(确保其合理性)和答案质量评分器(评估其完整性)进行验证。

Once documents are retrieved, a retrieval grader checks relevance to the query. If the documents are relevant, the system invokes a prompt-based RAG generator. Generated answers are validated with a hallucination grader (to ensure grounding) and an answer quality grader (to assess completeness).

整个系统通过 LangGraph 状态机进行协调,从而实现路由、评分、生成和引用等环节的条件流程。这种设计确保了响应的自适应合成,它利用静态知识和实时网络数据,并集成了质量控制机制,以保证可靠性和可信度。

The whole system is orchestrated via a LangGraph state machine, allowing conditional flow through routing, grading, generation, and citation. This design ensures adaptive response synthesis, using both static knowledge and live web data, with integrated quality control mechanisms for reliability and trustworthiness.

下图展示了一个多阶段的 RAG 工作流,该工作流集成了路由逻辑,用于确定查询解析的最佳路径。此架构结合了从矢量存储库中检索传统内容、对文档进行相关性分级以及可选的备用策略(例如网络搜索)。如果检索到的内容不足或无用,系统将启动一个生成阶段,并根据其效用情况进行重试和条件退出。这种支持路由的设计确保了对各种输入类型和检索失败情况的稳健处理,从而在实际部署中提升了准确性和适应性。

The following figure illustrates a multi-stage RAG workflow that incorporates routing logic to determine the most appropriate path for query resolution. This architecture combines traditional retrieval from a vector store, document grading for relevance, and optional fallback strategies such as web search. If the retrieved content is found to be insufficient or not useful, the system invokes a generation phase with retries and conditional exits based on utility. Such a routing-enabled design ensures robust handling of diverse input types and retrieval failures, supporting enhanced accuracy and adaptability in real-world deployments.

流程图显示了一个从 _start_ 开始,分支到 linkedin 或 vectorstore,然后执行诸如 retrieve、grade_documents、websearch、generate 等步骤,最后以 __end__ 或 not supported 结束的过程。

图 6.6:多阶段 RAG 工作流程及路由

Figure 6.6: Multi-stage RAG workflow with routing

结论

Conclusion

本章深入探讨了RAG系统的发展现状,重点阐述了从基本的两阶段模型向更复杂的多阶段架构的转变。我们探讨了密集检索交互如何支撑相关性匹配,以及评分机制如何通过评估事实性、相关性和完整性来增强响应的可信度。通过实现具有智能路由的多阶段RAG工作流程,我们展示了如何根据问题类型和内容质量动态选择检索源和生成路径。这种模块化和自适应设计为在实际应用中构建可扩展、可靠且具有上下文感知能力的GenAI系统铺平了道路。

In this chapter, we delved into the evolving landscape of RAG systems, emphasizing the shift from basic two-stage models to more sophisticated multi-stage architectures. We explored how dense retrieval interactions underpin relevance matching and how grading mechanisms enhance the trustworthiness of responses by assessing factuality, relevance, and completeness. Through the implementation of a multi-stage RAG workflow with intelligent routing, we demonstrated how retrieval sources and generation pathways can be dynamically selected based on question type and content quality. This modular and adaptive design paves the way for scalable, reliable, and context-aware GenAI systems in real-world applications.

下一章,我们将实现一个多模态检索系统,重点关注检索组件。

In the next chapter, we will implement a multimodal retrieval system, focusing exclusively on the retrieval component.

C第七章构建双向多模态检索系统

CHAPTER 7Building a Bidirectional Multimodal Retrieval System

介绍

Introduction

在日益视觉化和互联互通的数字世界中,跨不同模态(例如文本和图像)搜索和检索信息的能力已成为高级人工智能( AI ) 应用的基石。本章将介绍多模态检索的概念,即系统旨在理解和关联文本和视觉输入。与仅依赖文本相似性的传统搜索引擎不同,多模态系统利用图像和文本的矢量表示来提供更丰富、更具上下文关联性的搜索结果。您将学习如何构建这样一个系统:集成 Qdrant 作为矢量数据库,使用Hugging Face的对比语言-图像预训练( CLIP )模型生成图像嵌入,并使用 LangChain 来协调检索过程。这些工具支持对多种数据格式的统一访问,使用户能够执行灵活的跨模态搜索,例如从图像中检索描述或识别与文本输入匹配的图像。

In an increasingly visual and interconnected digital world, the ability to search and retrieve information across different modalities, such as text and images, has become a cornerstone of advanced artificial intelligence (AI) applications. This chapter introduces the concept of multimodal retrieval, where systems are designed to understand and correlate both textual and visual inputs. Unlike traditional search engines that rely solely on textual similarity, multimodal systems use vector representations from both images and text to deliver richer, more contextually aligned results. You will learn how to build such a system by integrating Qdrant as a vector database, Contrastive Language-Image pre-Training (CLIP) models from Hugging Face for generating image embeddings, and LangChain to orchestrate the retrieval process. These tools enable unified access to multiple data formats, allowing users to perform flexible cross-modal searches, such as retrieving descriptions from images or identifying images that match textual inputs.

本章将指导您构建双索引向量存储,并开发能够处理各种查询格式的混合检索器。基于 Python 的实现将引导您完成索引工作流、嵌入管道以及在不同模态之间无缝切换的检索逻辑。除了技术架构之外,本章还将深入探讨一些实用的设计决策,例如相似度评分、模态优先级排序和自定义检索逻辑。最终,您将掌握部署可用于生产环境的多模态检索器的技能,这为电子商务推荐、视觉内容发现等用例奠定了坚实的基础。以及语义搜索引擎。这种实践方法不仅能确保您理解理论,还能让您获得实施可扩展的实际解决方案的能力。

Throughout the chapter, you will construct dual-index vector stores and develop hybrid retrievers capable of handling diverse query formats. Python-based implementations will guide you through indexing workflows, embedding pipelines, and retrieval logic that switches seamlessly between modalities. Beyond technical architecture, the chapter delves into practical design decisions like similarity scoring, modality prioritization, and custom retrieval logic. By the end, you will have the skills to deploy a production-ready multimodal retriever, a foundation applicable to use cases in e-commerce recommendations, visual content discovery, and semantic search engines. This hands-on approach ensures you not only understand the theory but also gain the ability to implement scalable, real-world solutions.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 基于输出的多模态系统分类
  • Output-based classification of multimodal systems
  • 理解多模态检索系统
  • Understanding a multimodal retrieval system
  • 代码实现及说明
  • Code implementation and explanation
  • 为了读者们
  • To do for the readers

目标

Objectives

本章的目标是设计并实现一个能够处理文本和图像输入的多模态检索系统。读者将学习如何使用双编码器对来自多种模态的数据进行预处理和嵌入,如何规范化向量表示,以及如何高效地将其存储在诸如 Qdrant 之类的向量数据库中。该系统支持跨模态查询,例如从文本提示中检索图像,以及从图像输入中检索文本内容,从而实现跨异构数据类型的语义搜索。本章为构建智能的、模态感知型应用程序奠定了技术基础,并为读者在后续章节中通过集成生成模型来进一步扩展系统做好了准备。

The objective of this chapter is to design and implement a multimodal retrieval system capable of handling both text and image inputs. Readers will learn how to preprocess and embed data from multiple modalities using bi-encoders, normalize vector representations, and store them efficiently in a vector database such as Qdrant. The system supports cross-modal queries such as retrieving images from text prompts and textual content from image inputs, enabling semantic search across heterogeneous data types. This chapter lays the technical foundation for building intelligent, modality-aware applications and prepares readers to extend the system further by incorporating generative models in the subsequent chapter.

基于输出的多模态系统分类

Output-based classification of multimodal systems

本节以第二章“深入探索多模态系统”中介绍的基础概念为基础,概述了多模态系统的四种关键输出类型:文本到图像、图像到文本、文本和图像到图像以及文本到规范和图像。这些类别定义了如何使用不同的输入模态组合来生成特定的输出格式,构成了现代多模态人工智能应用的基础。通过基于输出类型对系统进行分类,我们创建了一个更清晰的框架,用于理解从图像生成模型到图像描述工具和规范驱动的设计引擎等各种技术如何在实际场景中发挥作用。这种分类不仅强化了前面提到的区别,而且还提供了一个结构化的视角,通过这个视角可以更好地理解和解决后续章节中提到的检索和生成方面的挑战。以下是简要回顾:

Building upon the foundational concepts introduced in Chapter 2, Deep Dive into Multimodal Systems, this section offers an overview of the four key output-based classifications of multimodal systems that are text-to-image, image-to-text, text and image-to-image, and text to specifications and image. These categories define how different combinations of input modalities are used to produce specific output formats, forming the backbone of modern multimodal AI applications. By organizing systems based on their output types, we create a clearer framework for understanding how diverse technologies, from image generation models to captioning tools and specification-driven design engines, function in real-world scenarios. This classification not only reinforces the distinctions made earlier but also provides a structured lens through which the retrieval and generation challenges in upcoming chapters can be better understood and implemented. A quick recap is as follows:

  • 文本到图像:这类系统接收文本提示并生成相应的图像。它们通常包含一个文本编码器,用于将提示转换为潜在表示,然后是一个生成模型,例如扩散模型或生成模型。对抗网络GAN )是一种将这种表示解码为视觉内容的网络。DALL·E、Imagen 和 Stable Diffusion 等模型都属于此类。在学术界,文本到图像的系统是跨模态生成建模的范例,它将语言映射到视觉图像,并广泛用于语义图像创建和创意内容任务。
  • Text-to-image: These systems take a textual prompt and generate a corresponding image. They typically consist of a text encoder that converts the prompt into a latent representation, followed by a generative model like a diffusion or generative adversarial network (GAN) based network that decodes this representation into visual content. Models like DALL·E, Imagen, and Stable Diffusion fall into this category. Academically, text-to-image systems are examples of cross-modal generative modeling, mapping language to visuals, and are used extensively for semantic image creation and creative content tasks.
  • 图像转文本:该系统处理输入图像并生成自然语言描述、标题或元数据。这涉及使用视觉骨干网络(例如卷积神经网络( CNN ) 或视觉转换器( ViT ))对图像进行编码,然后使用语言模型将其解码为文本。应用包括图像描述和视觉问答,输出内容从简短的描述(例如“一只狗在玩耍”)到更具互动性的问答回复不等。这些系统使机器能够以语言形式解释和表达视觉信息。
  • Image-to-text: Here, the system processes an input image and produces a natural language description, caption, or metadata. This involves encoding the image with a visual backbone—such as a convolutional neural network (CNN) or Vision Transformer (ViT), and decoding it into text using a language model. Applications include image captioning and visual question answering, where the output ranges from short captions like a dog playing to more interactive question and answer responses. These systems enable machines to interpret and articulate visual information in language form.
  • 文本和图像到图像:这些混合系统接受文本和图像作为输入,并输出修改或增强后的图像。这包括条件图像转换等任务,例如用户提供草图和“使其逼真”之类的提示,或者提供照片以及用于调整样式的文本说明。通过融合来自两种模态的嵌入信息并将其输入到条件图像生成器中,这些系统提供了语义视觉编辑功能,可用于创意设计和风格转换。
  • Text and image-to-image: These hybrid systems accept both text and image as input and output a modified or enhanced image. This includes tasks like conditional image translation, where a user might provide a sketch and a prompt like make it photorealistic, or a photo accompanied by text instructions for style adjustments. By merging embeddings from both modalities and feeding them into a conditional image generator, these systems offer semantic visual editing capabilities, useful for creative design and style transformation.
  • 文本转规格和图像:最先进的类别将文本输入与结构化规格和视觉渲染的生成相结合。例如,给定“设计一把具有特定尺寸的椅子”这样的提示,系统会输出规格表(例如,尺寸表)和椅子的视觉表示。这些系统将自然语言处理( NLP ) 与符号推理相结合,生成机器可读的规格和图像,非常适合产品设计、建筑规划和电子商务等需要精确视觉和结构输出的应用。
  • Text to specs and image: The most advanced category combines textual input with the generation of structured specifications and visual renderings. For instance, given a prompt like design a chair with specific dimensions, the system outputs both a specification sheet (e.g., dimension tables) and a visual representation of the chair. These systems integrate natural language processing (NLP) with symbolic reasoning to produce both machine-readable specs and images, ideal for applications in product design, architectural planning, and e-commerce, where precise visual and structural output is needed.

集成和设计影响

Integration and design implications

从理论角度来看,多模态系统涵盖了转换、对齐和融合三种范式。文本到图像和图像到文本系统主要关注模态之间的转换。文本到图像和图像到图像类展示了在图像生成之前对组合的多模态嵌入进行融合的过程。最后,文本到规范和图像类融合了转换(文本到规范)、结构生成(规范到图像)和融合,能够处理符号和视觉输出。

Viewed through a theoretical lens, multimodal systems span translation, alignment, and fusion paradigms. Text-to-image and image-to-text systems primarily focus on translating between modalities. The text and image-to-image class demonstrates the fusion of combined multimodal embeddings before image generation. Lastly, the text to specifications and image category blends translation (text to specs), structure generation (specs to image), and fusion, handling both symbolic and visual outputs.

识别这些类别对于设计多模态检索系统至关重要,例如第六章两阶段和多阶段GenAI系统”中讨论的混合检索器,其中索引、查询和检索必须适应各种不同的输入/输出模态。分类决定了我们如何构建向量存储、制定嵌入策略以及定义跨模态搜索能力,以完成诸如查找符合规范的图像从图像中检索规范之类的任务

Recognizing these categories is crucial for designing multimodal retrieval systems, such as the hybrid retrievers discussed in Chapter 6, Two and Multi-stage GenAI Systems, where indexing, querying, and retrieval must accommodate diverse input/output modalities. Such classification informs how we build vector stores, craft embedding strategies, and define cross-modal search capabilities for tasks like finding images matching a specification or retrieve specs from an image.

理解多模态检索系统

Understanding a multimodal retrieval system

图 7.1展示了一个多模态检索系统架构,其中检索过程可在文本和图像模态之间无缝运行。从学术角度来看,这种方法利用了嵌入(一种基于向量的表示方法,用于捕捉数据中的语义关系),并分别针对文本和图像内容生成嵌入。

Figure 7.1 illustrates a multimodal retrieval system architecture, where retrieval operates seamlessly across text and image modalities. Academically, this approach leverages embeddings, a vector-based representation capturing semantic relationships within data, generated separately for textual and visual content.

该过程始于用户查询,查询内容可以包含文本、图像或两者兼有。这些输入会被传递给专门的嵌入模型:文本嵌入模型用于处理文本查询或文档,图像嵌入模型用于处理视觉输入。文档会被分块成更小的单元,以提高检索的粒度和效率,而图像则直接嵌入到向量表示中。

The process initiates with user queries, which may consist of text, images, or both. These inputs are passed to specialized embedding models: a text embedding model for textual queries or documents, and an image embedding model for visual inputs. Documents undergo chunking into smaller units to improve the granularity and efficiency of retrieval, whereas images are directly embedded into the vector representation.

词嵌入计算完成后,会被存储在一个多模态向量数据库中,该数据库旨在处理混合数据类型。系统接收到查询后,会在该数据库中执行向量相似性搜索,并基于语义接近性而非精确匹配来检索结果。最终返回的结果(包含文本块和图像)会呈现给用户。

Once embeddings are computed, they are stored in a multimodal vector database designed to handle mixed data types. Upon receiving a query, the system performs vector similarity searches across this database, retrieving results based on semantic proximity rather than exact matches. The returned results, combining textual chunks and images, are then provided back to the user.

在各种应用场景中,这种多模态检索系统是跨模态搜索、基于内容的图像检索和集成语义推荐系统等高级应用的基础。文本和图像嵌入的结合使用提高了检索的准确性和上下文相关性,从而支持更丰富、更直观的用户交互。

In contexts, such multimodal retrieval systems are foundational for advanced applications like cross-modal search, content-based image retrieval, and integrated semantic recommendation systems. The combined use of text and image embeddings enhances the accuracy and contextual relevance of retrieval, supporting richer, more intuitive user interactions.

技术架构

Technical architecture

多模态检索系统的出现标志着信息检索领域的一项重大进步,它使系统能够处理和语义对齐不同数据模态(例如文本和图像)的内容。本文讨论的架构示意图展示了一个强大的框架,该框架将文本和图像嵌入管道与共享或协调的向量空间集成在一起,以实现高精度的跨模态搜索。本节将全面阐述支撑该架构的技术组件、数据流机制和系统设计原则。

The advent of multimodal retrieval systems marks a pivotal advancement in the field of information retrieval, enabling systems to process and semantically align content across distinct data modalities such as text and images. The architectural schematic under discussion illustrates a robust framework that integrates text and image embedding pipelines with a shared or coordinated vector space for high-precision, cross-modal search. This section provides a comprehensive exposition of the technical components, data flow mechanisms, and system design principles underpinning architectures.

下图展示了一个多模态检索系统的架构,该系统整合了文本和图像数据,以实现统一的查询处理。用户查询可能包含文本和/或图像,并使用针对每种模态定制的独立嵌入模型进行编码。这些嵌入存储在一个多模态向量数据库中,该数据库支持跨文档块和图像的联合检索。查询发生时,系统执行向量相似性搜索,并返回语义一致的两种数据类型的结果,从而实现稳健且上下文丰富的响应生成。

The following figure presents the architecture of a multimodal retrieval system that integrates both textual and visual data for unified query processing. User queries, which may include text and/or images, are encoded using separate embedding models tailored to each modality. These embeddings are stored in a multimodal vector database that supports joint retrieval across document chunks and images. Upon querying, the system performs vector similarity search and returns semantically aligned results from both data types, thereby enabling robust and context-rich response generation.

该流程图展示了使用文本和图像嵌入模型进行查询处理的流程。文档和图像被分块、嵌入,存储在矢量数据库中,并通过多模态矢量搜索进行检索,最终将结果返回给用户。

图 7.1:基础多模态检索系统

Figure 7.1: A foundation multimodal retrieval system

该系统的架构依赖于多模态处理,用户查询可以来自文本、图像或二者的组合。为了有效处理这种情况,该流程采用了专门的嵌入模型,将这些输入统一到一个共享的语义空间中。关键组件概述如下

The system’s architecture relies on multimodal processing, where user queries can originate from text, images, or a combination of both. To handle this effectively, the pipeline employs specialized embedding models that unify these inputs into a shared semantic space. The key components are outlined in the following list:

  • 用户交互与查询接收:系统通过用户查询启动,查询形式可以是文本输入(例如描述或问题)、视觉输入(例如产品图片)或二者的混合。这些多模态查询反映了现实世界的信息检索行为,因此需要将其转换为一种能够保留语义意图的通用表示格式。这可以通过嵌入模型来实现,该模型将原始输入编码为潜在语义空间内的密集向量表示。
  • User interaction and query intake: The system is initiated through user queries, which may take the form of textual inputs (e.g., descriptions or questions), visual inputs (e.g., product images), or hybrid combinations of both. These multimodal queries reflect real-world information-seeking behaviors and necessitate translation into a common representational format that preserves semantic intent. This is achieved through embedding models that encode raw input into dense vector representations within a latent semantic space.
  • 文本和图像模态的嵌入模型:该架构的核心是针对每种模态量身定制的专用嵌入模型:
    • 文本嵌入模型:将自然语言内容转换为高维向量。这些模型通常使用最先进的Transformer架构来实现,例如Sentence-BERT、text-embedding-ada或类似的大规模语言模型。它们能够捕捉句法结构和语义上下文,从而实现句子或段落级别的细粒度检索。
    • 图像嵌入模型:使用预训练的视觉语言模型VLM )将视觉内容转换为潜在向量表示,例如:CLIP。这些模型经过训练,可以在共享的语义空间中对齐视觉和文本表示,从而实现直接的跨模态比较。

    由此产生的嵌入提供了与模态无关的编码,有助于在异构数据类型之间进行高效的相似性搜索。

  • Embedding models for text and image modalities: At the core of the architecture are dedicated embedding models tailored for each modality:
    • Text embedding model: Converts natural language content into high-dimensional vectors. These models are typically instantiated using state-of-the-art transformer architectures such as Sentence-BERT, text-embedding-ada, or similar large-scale language models. They capture syntactic structure and semantic context, enabling fine-grained retrieval at the sentence or paragraph level.
    • Image embedding model: Transforms visual content into latent vector representations using pre-trained vision-language models (VLMs) such as CLIP. These models are trained to align visual and textual representations in a shared semantic space, allowing direct cross-modal comparison.

    The resulting embeddings provide modality-agnostic encodings that facilitate efficient similarity search across heterogeneous data types.

  • 文档分块和嵌入准备:为了确保与模型标记限制兼容并提高检索粒度,冗长的文本文档会被分割成更小、更连贯的单元,称为块(chunk)。每个块都使用文本编码器进行单独嵌入。类似地,图像(无论是独立的还是从文档中提取的)都通过视觉编码器进行嵌入。此阶段的输出是一组结构化的矢量化文本和图像片段,每个片段都链接到相关的元数据,例如来源标识符、在文档中的位置和语义标签。
  • Document chunking and embedding preparation: To ensure compatibility with model token limits and to enhance retrieval granularity, lengthy textual documents are segmented into smaller, coherent units known as chunks. Each chunk is individually embedded using the text encoder. Similarly, images, either standalone or extracted from documents, are embedded through the visual encoder. The output of this stage is a structured set of vectorized text and image segments, each linked with relevant metadata such as source identifier, location within the document, and semantic tags.
  • 多模态向量存储集成:嵌入式表示存储在向量数据库中,该数据库旨在支持多模态索引和搜索操作。此类系统的示例包括 Qdrant、Weaviate 和 Pinecone,它们都通过诸如分层可导航小世界( HNSW ) 图之类的索引算法提供高性能的近似最近邻( ANN ) 搜索

    向量存储必须支持:

    • 统一索引:一个索引可以容纳来自两种模态的嵌入。
    • 元数据过滤:基于有效载荷元数据(例如,时间戳、类别)的结构化过滤。
    • 相似度指标:可配置的评分策略,例如余弦相似度或点积,用于衡量查询和存储的嵌入之间的语义接近程度。

    该架构可以选择性地采用双索引结构,其中文本和图像嵌入分别存储和查询,并在后处理期间应用融合逻辑。

  • Multimodal vector store integration: The embedded representations are stored in a vector database designed to support multimodal indexing and search operations. Examples of such systems include Qdrant, Weaviate, and Pinecone, all of which offer high-performance approximate nearest neighbor (ANN) search via indexing algorithms such as hierarchical navigable small world (HNSW) graphs.

    The vector store must support:

    • Unified indexing: A single index accommodating embeddings from both modalities.
    • Metadata filtering: Structured filtering based on payload metadata (e.g., timestamps, categories).
    • Similarity metrics: Configurable scoring strategies, such as cosine similarity or dot product, to measure semantic proximity between query and stored embeddings.

    This architecture may optionally employ dual-index structures, where text and image embeddings are stored and queried separately, with fusion logic applied during post-processing.

  • 查询编码和相似度搜索:系统接收到用户查询后,动态确定输入模态并应用相应的嵌入模型。对于混合查询,系统会同时为文本和图像组件生成嵌入。然后,将查询向量提交到向量存储库,以根据向量邻近度检索最相似的前 k 个嵌入。高级实现可能包含以下内容:
    • 后期融合策略:合并来自不同模态指标的排名结果。
    • 分数归一化:使不同嵌入分布之间的相似性分数保持一致。
    • 跨模态重排序:使用双编码器或交叉编码器架构来改进初始搜索结果。

    此阶段会生成一个检索向量的排名列表,每个向量对应于数据库中存储的文本块或图像片段。

  • Query encoding and similarity search: Upon receipt of a user query, the system dynamically determines the input modality and applies the corresponding embedding model. For hybrid queries, embeddings are generated for both text and image components. The query vectors are then submitted to the vector store to retrieve the top-k most similar embeddings based on vector proximity. Advanced implementations may incorporate the following:
    • Late fusion strategies: Combining ranked results from separate modality indexes.
    • Score normalization: Aligning similarity scores across heterogeneous embedding distributions.
    • Cross-modal reranking: Using bi-encoder or cross-encoder architectures to refine initial search results.

    This stage yields a ranked list of retrieved vectors, each corresponding to a text chunk or image segment stored in the database.

  • 结果映射和响应生成:检索到的向量标识符会使用存储的元数据映射回其原始内容,无论是文档片段还是图像文件。这些内容会被格式化并呈现给用户。此外,还可以选择调用生成式语言模型(例如,基于 GPT 的模型)来:
    • 总结检索到的内容。
    • 生成自然语言答案。
    • 通过多轮对话提高结果的可解释性。

    最终的呈现层弥合了密集的、基于矢量的内部表示与用户认知期望之间的差距,从而提供可解释的、与上下文相关的结果。

  • Result mapping and response generation: The retrieved vector identifiers are mapped back to their original content, be it document snippets or image files, using stored metadata. These are formatted and presented to the user. Optionally, a generative language model (e.g., GPT-based) may be invoked to:
    • Summarize the retrieved content.
    • Generate a natural language answer.
    • Enhance the interpretability of results through multi-turn dialogue.

    This final presentation layer bridges the gap between the dense, vector-based internal representation and the user's cognitive expectations, delivering explainable and contextually relevant results.

  • 技术增强和设计优化:多模态检索系统可以融入额外的智能层,以优化性能和准确性:
    • 模态路由:一种策略机制,用于检测主要查询模态并将其路由到相应的检索器。
    • 嵌入缓存:通过存储频繁查询的输入的嵌入来减少推理延迟。
    • 检索增强:支持查询扩展或伪相关性反馈等技术。
  • Technical enhancements and design optimizations: Multimodal retrieval systems may incorporate additional layers of intelligence to optimize performance and accuracy:
    • Modality routing: A policy mechanism that detects dominant query modality and routes it to the appropriate retriever.
    • Embedding caching: Reduces inference latency by storing embeddings of frequently queried inputs.
    • Retrieval augmentation: Supports techniques such as query expansion or pseudo-relevance feedback.

这些特性共同提升了系统的可扩展性、响应速度和实时应用的有效性。

These features collectively contribute to the system's scalability, responsiveness, and effectiveness in real-time applications.

应用及影响

Applications and implications

此类架构是众多人工智能驱动型应用的基础,包括但不限于:

Such architectures are foundational in a range of AI-driven applications, including but not limited to:

  • 视觉产品搜索:在电子商务平台中将产品图片与描述进行匹配。
  • Visual product search: Matching product images to descriptions in e-commerce platforms.
  • 医学影像检索:根据诊断图像查找带注释的放射学报告。
  • Medical imaging retrieval: Finding annotated radiology reports based on diagnostic images.
  • 多模态问答:根据图表和附带的文本上下文检索答案。
  • Multimodal QA: Retrieving answers based on diagrams and accompanying textual context.
  • 交互式内容发现:使用户能够通过描述性提示或参考图像搜索档案、图库或数据库。
  • Interactive content discovery: Enabling users to search archives, galleries, or databases using descriptive prompts or reference images.

代码实现及说明

Code implementation and explanation

现在您已经了解了上述多模态检索系统如何在一个统一的架构中融合自然语言处理(NLP)、计算机视觉和向量相似性搜索。通过实现无缝的跨模态交互,它为在复杂的信息环境中进行实时、语义丰富的检索提供了一个强大的框架。对于数据科学和生成式人工智能GenAI )的从业者而言,掌握此类系统的设计和实现对于推进多模态人工智能应用的发展至关重要。让我们通过一个代码示例来理解它。该代码已包含在本书中。

So now that you understand that the multimodal retrieval system detailed above exemplifies the convergence of NLP, computer vision, and vector similarity search in a unified architecture. By enabling seamless cross-modal interaction, it provides a powerful framework for real-time, semantically rich retrieval across complex information landscapes. For data science and generative AI (GenAI) practitioners, mastering the design and implementation of such systems is essential to advancing the state of multimodal AI applications. Let us understand it using a code example. The code is shared as part of this book.

要求

Requirement

以下Python库构成了实现多模态检索系统所需的基础软件栈,该系统集成了文本和图像嵌入、向量索引和实时交互功能。每个软件包都经过精心挑选,以支持向量表示学习、语义搜索、文档解析和交互式界面设计等关键功能。

The following Python libraries constitute the foundational software stack required to implement a multimodal retrieval system that integrates text and image embeddings, vector indexing, and real-time interaction. Each package has been carefully selected to support key functionalities such as vector representation learning, semantic search, document parsing, and interactive interface design.

  • Streamlit
    • 目的:它提供了一个轻量级、声明式的框架,用于构建基于 Web 的用户界面。
    • 用途:Streamlit 用于创建多模态交互的前端,允许用户输入文本或上传图像,并实时接收视觉或文本结果。
    • 背景:它有助于快速构建以数据为中心的应用程序原型,并广泛用于机器学习模型的演示研究中。
  • Streamlit:
    • Purpose: It provides a lightweight and declarative framework for building web-based user interfaces.
    • Usage: Streamlit is employed to create the frontend for multimodal interaction, allowing users to input text or upload images, and receive visual or textual results in real-time.
    • Context: It facilitates rapid prototyping of data-centric applications and is widely used in research for the demonstration of machine learning models.
  • QdrantClient
    • 用途:用于与 Qdrant 向量数据库交互的 Python 客户端。
    • 用途:它支持插入、查询和管理存储为高维向量的嵌入等操作。Qdrant 支持高效的人工神经网络搜索。
    • 背景:由于支持有效载荷、过滤和多个集合索引,因此适合存储多模态向量嵌入。
  • QdrantClient:
    • Purpose: A Python client for interacting with the Qdrant vector database.
    • Usage: It enables operations such as inserting, querying, and managing embeddings stored as high-dimensional vectors. Qdrant supports efficient ANN search.
    • Context: It is suitable for storing multimodal vector embeddings due to its support for payloads, filtering, and multiple collection indexes.
  • LlamaIndex
    • 目的:构建检索增强生成RAG )系统的框架
    • 用途:它结合语言模型,可实现索引、分块和文档检索。可与 Qdrant 和其他向量数据库无缝集成。
    • 背景:它能够构建可扩展的 RAG 管道,将检索与生成推理相结合,用于开放域 QA 和文档合成等任务。
  • LlamaIndex:
    • Purpose: A framework for building retrieval-augmented generation (RAG) systems.
    • Usage: It facilitates indexing, chunking, and document retrieval in combination with language models. Integrates seamlessly with Qdrant and other vector databases.
    • Context: It enables the construction of scalable RAG pipelines that combine retrieval with generative reasoning for tasks such as open-domain QA and document synthesis.
  • LangChain
    • 用途:一个强大的编排库,用于将大型语言模型( LLM ) 调用、检索机制、工具和用户提示链接在一起。
    • 用途:它管理多模态管道中的嵌入生成、文档检索和 LLM 驱动的后处理。
    • 背景:它被广泛用于以LLM为中心的系统研究中,用于构建代理工作流和LLM驱动的助手。
  • LangChain:
    • Purpose: A powerful orchestration library for chaining together large language model (LLM) calls, retrieval mechanisms, tools, and user prompts.
    • Usage: It manages embedding generation, document retrieval, and LLM-driven post-processing in the multimodal pipeline.
    • Context: It is widely used in LLM-centric systems research for constructing agentic workflows and LLM-powered assistants.
  • LangChain社区
    • 目的:扩展核心 LangChain 库,提供社区贡献的集成和工具。
    • 用途:它支持与第三方嵌入模型、检索器和文档加载器(这些模型、检索器和文档加载器可能不在核心包中)的连接器。
    • 背景:它通过提供对各种开源数据和模型工具的访问,鼓励可重复性和模块化实验。
  • LangChain Community:
    • Purpose: An extension of the core LangChain library, offering community-contributed integrations and tools.
    • Usage: It supports connectors to third-party embedding models, retrievers, and document loaders that may not be in the core package.
    • Context: It encourages reproducibility and modular experimentation by enabling access to a broad range of open-source data and model utilities.
  • LangChain 和 Nomic
    • 目的:LangChain 和 Nomic AI 的嵌入和索引工具之间的集成层。
    • 用途:可用于试验替代嵌入后端或嵌入空间的高级可视化(例如,通过 Nomic Atlas)。
    • 背景:它为多模态实验和索引策略提供了更大的灵活性。
  • LangChain and Nomic:
    • Purpose: Integration layer between LangChain and Nomic AI's embedding and indexing tools.
    • Usage: It may be used to experiment with alternative embedding backends or advanced visualization of embedding spaces (e.g., via Nomic Atlas).
    • Context: It provides additional flexibility for multimodal experimentation and indexing strategies.
  • 句子转换器
    • 用途:一个使用诸如 all-MiniLM、multi-qa-mpnet 等模型生成高质量文本嵌入的库。
    • 用途:它将用户查询和文本块转换为密集向量表示,以便进行基于相似性的检索。
    • 背景:它被认为是语义文本相似性、问答和文档聚类任务的标准工具包。
  • Sentence Transformers:
    • Purpose: A library for generating high-quality text embeddings using models like all-MiniLM, multi-qa-mpnet, etc.
    • Usage: It converts user queries and textual chunks into dense vector representations for similarity-based retrieval.
    • Context: It is considered as a standard toolkit for semantic textual similarity, question answering, and document clustering tasks.
  • 变形金刚
    • 用途:Hugging Face 旗舰库,用于预训练的 Transformer 模型(例如 BERT、CLIP、GPT、ViT)。
    • 用途:它支持文本和图像嵌入模型,例如用于图像-文本对齐的 CLIP 或用于文本嵌入的 BERT。
    • 背景:它是大多数现代自然语言处理、视觉和多模态人工智能研究和实验的核心。
  • Transformers:
    • Purpose: Hugging Face flagship library for pre-trained transformer models (e.g., BERT, CLIP, GPT, ViT).
    • Usage: It powers both text and image embedding models, such as CLIP for image-text alignment or BERT for textual embeddings.
    • Context: It is central to most modern NLP, vision, and multimodal AI research and experimentation.
  • 枕头
    • 用途:一个用于图像处理和文件处理的 Python 图像处理库。
    • 用途:它处理上传的图像文件,进行转换、调整大小和预处理,然后再进行嵌入。
    • 背景:对于在多模态系统中集成视觉输入并使其为图像编码器中的推理做好准备至关重要。
  • Pillow:
    • Purpose: A Python imaging library for image processing and file handling.
    • Usage: It handles uploaded image files, conversion, resizing, and preprocessing prior to embedding.
    • Context: It is essential for integrating visual inputs in multimodal systems and preparing them for inference in image encoders.
  • pypdf
    • 用途:PDF 解析库,用于从文档文件中提取文本和元数据。
    • 用途:支持将基于 PDF 的知识源(例如,研究论文、手册)导入 RAG 流程。
    • 背景:支持文档理解和内容索引,常用于自动摘要和搜索引擎。
  • pypdf:
    • Purpose: PDF parsing library for extracting text and metadata from document files.
    • Usage: Enables ingestion of PDF-based knowledge sources (e.g., research papers, manuals) into the RAG pipeline.
    • Context: Supports document understanding and content indexing, commonly used in automated summarization and search engines.
  • scikit-learn
    • 用途:一个用于经典算法和实用程序的标准机器学习库。
    • 用途:支持聚类、降维(例如主成分分析( PCA ))和嵌入空间分析中的评估指标(例如余弦距离)。
    • 背景:是机器学习研究中基线实验、预处理流程和特征工程工作流程不可或缺的一部分。
  • scikit-learn:
    • Purpose: A standard machine learning library for classical algorithms and utilities.
    • Usage: Supports clustering, dimensionality reduction e.g., principal component analysis (PCA), and evaluation metrics (e.g., cosine distance) in embedding space analysis.
    • Context: Integral to baseline experimentation, preprocessing pipelines, and feature engineering workflows in ML studies.
  • NumPy
    • 用途:Python 中用于数值计算的基础软件包。
    • 用途:支撑所有与向量、矩阵、相似度评分和数组操作相关的数学运算。
    • 背景:人工智能和数据科学研究中数值计算的支柱,确保可重复性和精确性。
  • NumPy:
    • Purpose: Fundamental package for numerical computing in Python.
    • Usage: Underpins all mathematical operations related to vectors, matrices, similarity scoring, and array manipulation.
    • Context: The backbone of numerical computation across AI and data science research, ensuring reproducibility and precision.

前端

Frontend

以下代码提供了一个实用且可扩展的双向多模态搜索系统实现,该系统集成了基于嵌入的语义理解和可扩展的向量存储后端。其模块化设计使得可以轻松扩展其他模态(例如音频、表格数据)以及高级功能,例如跨模态重排序、混合检索或用户反馈循环。它堪称使用轻量级 Web 框架和可组合 AI 组件在实时应用中实现多模态嵌入的典范。

The following code provides a practical and extensible implementation of a bidirectional multimodal search system, integrating embedding-based semantic understanding with a scalable vector store backend. Its modular design allows for the straightforward extension of additional modalities (e.g., audio, tabular data) and advanced features such as cross-modal reranking, hybrid retrieval, or user feedback loops. It serves as a canonical example of operationalizing multimodal embeddings in real-time applications using lightweight web frameworks and composable AI components.

这款基于 Streamlit 的应用程序提供了一个轻量级的用户界面,用于执行双向多模态检索,使用户能够进行文本到图像以及图像到文本的搜索。该实现集成了向量嵌入、相似性搜索和基于有效载荷的检索,并清晰地展示了如何通过现代化的交互式界面实现多模态嵌入。

This streamlit-based application presents a lightweight user interface for performing bidirectional multimodal retrieval, enabling users to search from text-to-image and from image-to-text. The implementation integrates vector embedding, similarity search, and payload-based retrieval, and provides a clear example of how multimodal embeddings can be operationalized through a modern, interactive interface.

以下部分将详细分解并概述多模态检索应用程序的关键功能组件,涵盖初始化、界面设计以及文本到图像和图像到文本路径的查询处理。每个步骤对于实现用户输入、嵌入生成和基于向量的语义搜索之间的无缝交互都至关重要。

The following section breaks down and outlines the key functional components of the multimodal retrieval application, covering initialization, interface design, and query processing for both text-to-image and image-to-text pathways. Each step is crucial in enabling seamless interaction between user inputs, embedding generation, and vector-based semantic search.

  • 环境设置和路径配置:代码首先导入核心模块:streamlit用于界面设计,PIL.Image用于图像处理,os/sys用于文件系统和路径操作。

    import streamlit as st

    从 PIL 导入图像

    导入系统

    导入操作系统

    以下代码片段确定根目录的绝对路径,并将其附加到系统路径,以允许跨模块导入:

    ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))

    如果 ROOT_DIR 不在 sys.path 中:

    sys.path.append(ROOT_DIR)

    这样可以确保在项目层次结构中访问子模块(例如rag.index_builder )而不会出现导入错误,从而促进模块化软件设计。

  • Environment setup and path configuration: The code begins by importing core modules: streamlit for interface design, PIL.Image for image handling, and os/sys for file system and path manipulations.

    import streamlit as st

    from PIL import Image

    import sys

    import os

    The following snippet determines the absolute path of the root directory and appends it to the system path to allow cross-module imports:

    ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))

    if ROOT_DIR not in sys.path:

    sys.path.append(ROOT_DIR)

    This ensures that submodules within the project hierarchy (e.g., rag.index_builder) can be accessed without import errors, promoting modular software design.

  • 通过缓存进行索引初始化

    from rag.index_builder import build_vectorstores, TEXT_COLLECTION, IMAGE_COLLECTION

    @st.cache_resource(show_spinner="正在加载向量索引...")

    def init_system():

    返回 build_vectorstores()

    客户端,mm_embed = init_system()

    这里调用`build_vectorstores() ` 来初始化多模态向量索引。该函数使用了`@st.cache_resource`装饰器,用于缓存结果,避免每次页面刷新时都重新初始化嵌入或加载向量数据。这对于加载大型嵌入模型或查询向量数据库等高延迟操作尤为重要。

    • 客户端:负责与向量数据库(例如 Qdrant)进行交互的对象。
    • mm_embed :一个自定义实用程序,提供特定模态的嵌入生成,即get_text_embedding()get_image_embedding()
    • 常量TEXT_COLLECTIONIMAGE_COLLECTION指定查询执行的目标向量索引或集合。
  • Index initialization via caching:

    from rag.index_builder import build_vectorstores, TEXT_COLLECTION, IMAGE_COLLECTION

    @st.cache_resource(show_spinner="Loading vector index...")

    def init_system():

    return build_vectorstores()

    client, mm_embed = init_system()

    Here, build_vectorstores() is invoked to initialize the multimodal vector index. The function is decorated with @st.cache_resource, which caches the result to avoid reinitializing embeddings or loading vector data on every page refresh. This is particularly important for high-latency operations, such as loading large embedding models or querying vector databases.

    • client: The object responsible for interfacing with the vector database (e.g., Qdrant).
    • mm_embed: A custom utility that provides modality-specific embedding generation, i.e., get_text_embedding() and get_image_embedding().
    • The constants TEXT_COLLECTION and IMAGE_COLLECTION specify the target vector index or collection for query execution.
  • 用户界面和模式选择

    st.title(" 🔍多模态搜索演示(文本↔图像)")

    option = st.radio("选择您的查询类型:", ["文本 → 图片", "图片 → 文本"])

    界面首先显示标题,然后是用于选择两种交互模式的单选按钮:

    • 文本图像:用户输入文本查询,即可获得匹配的图像。
    • 图片文字:用户上传图片并接收相关的文字内容。
  • User interface and mode selection:

    st.title(" 🔍 Multimodal Search Demo (Text ↔ Image)")

    option = st.radio("Choose your query type:", ["Text → Image", "Image → Text"])

    The interface begins with a title, followed by a radio button selection for two modes of interaction:

    • Text Image: The user enters a textual query and receives a matching image.
    • Image Text: The user uploads an image and receives relevant textual content.

这种条件分支驱动着应用程序流程的其余部分。

This conditional branching drives the remainder of the application flow.

  • 文本到图像的检索路径

    如果选项 == "文本 → 图像":

    query = st.text_input("请输入文本提示以获取相关图像:")

    如果查询:

    st.write(f"正在搜索与以下图片相似的图片:*{query}*")

    q_vec = mm_embed.get_text_embedding(query)

    用户提交文本提示后,会使用 ` get_text_embedding()`方法将其嵌入到高维向量中。这种向量表示形式能够捕捉输入查询的语义意图。

    res = client.query_points(

    collection_name=IMAGE_COLLECTION,

    查询=q_vec,

    使用="image",

    with_payload=["image"],

    limit=1,

    嵌入的查询向量通过语义相似性搜索提交到向量数据库IMAGE_COLLECTION 。仅检索前 1 个匹配项。参数with_payload=["image"]表示应将关联的图像文件名与向量匹配项一起返回。

    如果 res 和 res.points:

    image_file = res.points[0].payload["image"]

    st.image(f"data/images/{image_file}", caption="最佳匹配", use_column_width=True)

    别的:

    st.warning("未找到匹配的图像。")

    如果返回结果,则使用有效载荷来查找并渲染匹配的图像。如果在矢量存储中不存在语义上接近的匹配项,则会相应地通知用户。

  • Text-to-image retrieval pathway:

    if option == "Text → Image":

    query = st.text_input("Enter a text prompt to retrieve relevant image:")

    if query:

    st.write(f"Searching for image similar to: *{query}*")

    q_vec = mm_embed.get_text_embedding(query)

    Once the user submits a text prompt, it is embedded into a high-dimensional vector using the get_text_embedding() method. This vector representation captures the semantic intent of the input query.

    res = client.query_points(

    collection_name=IMAGE_COLLECTION,

    query=q_vec,

    using="image",

    with_payload=["image"],

    limit=1,

    )

    The embedded query vector is submitted to the vector database (IMAGE_COLLECTION) using a semantic similarity search. Only the top-1 match is retrieved. The parameter with_payload=["image"] indicates that the associated image filename should be returned alongside the vector match.

    if res and res.points:

    image_file = res.points[0].payload["image"]

    st.image(f"data/images/{image_file}", caption="Top Match", use_column_width=True)

    else:

    st.warning("No image match found.")

    If a result is returned, the payload is used to locate and render the matching image. If no semantically close match exists in the vector store, the user is notified accordingly.

  • 图像到文本的检索路径

    elif option == "Image → Text":

    uploaded_img = st.file_uploader("上传图片以查找相关文本", type=["png", "jpg", "jpeg"])

    反向模式下,用户上传一张图片。上传的图片会临时保存到磁盘以供后续处理:

    如果 uploaded_img:

    with open("temp_input_image.jpg", "wb") as f:

    f.write(uploaded_img.read())

    st.image("temp_input_image.jpg", caption="上传的图片", use_column_width=True)

    保存后,图像会被传递给get_image_embedding()方法:

    img_vec = mm_embed.get_image_embedding("temp_input_image.jpg")

    然后使用生成的向量查询 TEXT_COLLECTION:

    res = client.query_points(

    collection_name=TEXT_COLLECTION,

    查询=img_vec,

    使用="text",

    with_payload=["source"],

    limit=1,

    向量搜索会检索语义最相关的文本片段。有效载荷["source"]包含检索到的文本内容:

    如果 res 和 res.points:

    source_text = res.points[0].payload["source"]

    st.success("匹配项最多:")

    st.write(source_text)

    别的:

    st.warning("未找到相关文本。")

    结果随后会显示在界面上。如果没有结果达到相似度阈值,则会发出警告信息。

  • Image-to-text retrieval pathway:

    elif option == "Image → Text":

    uploaded_img = st.file_uploader("Upload an image to find related text", type=["png", "jpg", "jpeg"])

    For the reverse mode, the user uploads an image. The uploaded image is temporarily saved to disk for further processing:

    if uploaded_img:

    with open("temp_input_image.jpg", "wb") as f:

    f.write(uploaded_img.read())

    st.image("temp_input_image.jpg", caption="Uploaded Image", use_column_width=True)

    Once saved, the image is passed to the get_image_embedding() method:

    img_vec = mm_embed.get_image_embedding("temp_input_image.jpg")

    The resulting vector is then used to query the TEXT_COLLECTION:

    res = client.query_points(

    collection_name=TEXT_COLLECTION,

    query=img_vec,

    using="text",

    with_payload=["source"],

    limit=1,

    )

    The vector search retrieves the most semantically relevant text snippet. The payload["source"] contains the retrieved textual content:

    if res and res.points:

    source_text = res.points[0].payload["source"]

    st.success("Top matching text:")

    st.write(source_text)

    else:

    st.warning("No relevant text found.")

    Results are then rendered on the interface. In case no result meets the similarity threshold, a warning message is issued.

数据目录

Data directory

所示的目录结构体现了多模态检索系统清晰且模块化的组织方式。根文件夹data包含两个子目录:documentsimages。documents文件夹通常存储文本源(例如 PDF、文本或 Markdown 文件),这些文本源稍后会被分块并使用文本编码器进行嵌入。images 文件夹包含视觉数据(例如 PNG、JPG) 这些数据将使用图像嵌入模型进行处理。这种分离结构支持对每种模态进行独立的预处理和索引,从而简化了多模态嵌入、存储和检索工作流程,尤其适用于搜索、图像描述或跨模态问答等任务。

The directory structure shown reflects a clean and modular organization for a multimodal retrieval system. The root folder data contains two subdirectories: documents and images. The documents folder typically stores textual sources (e.g., PDFs, text, or Markdown files) that are later chunked and embedded using text encoders. The images folder contains visual data (e.g., PNG, JPG) to be processed using an image embedding model. This separation supports independent preprocessing and indexing of each modality, facilitating streamlined multimodal embedding, storage, and retrieval workflows in systems built for tasks like search, captioning, or cross-modal question answering.

文件目录的屏幕截图,显示一个名为“data”的主文件夹,其中包含两个子文件夹:“documents”和“images”。所有文件夹均使用深色背景上的蓝色文件夹图标。

图 7.2:图像文件夹包含视觉数据和文本数据

Figure 7.2: The images folder contains visual data and textual data

检索系统

The retrieval system

检索系统的文件夹结构:以下目录展示了检索系统及其关联部分,代表了 RAG 系统的核心模块。它包含处理数据处理不同阶段的 Python 源文件:

Folder structure of the retrieval system: the following directory showcases the retrieval system and its association represents the core module of a RAG system. It contains Python source files that handle different stages of data processing:

  • loaders.py :负责将文件系统中的文档和图像加载到内存中。
  • loaders.py: It is responsible for loading documents and images from the filesystem into memory.
  • embedding_utils.py :它提供了一些实用函数,用于使用预训练模型从文本和图像输入生成嵌入。
  • embedding_utils.py: It provides utility functions for generating embeddings from text and image inputs using pre-trained models.
  • index_builder.py :它负责协调创建和填充多模态搜索的向量索引(例如,在 Qdrant 中)的过程。
  • index_builder.py: It orchestrates the process of creating and populating vector indexes (e.g., in Qdrant) for multimodal search.
  • __init__.py :它将文件夹标记为 Python 包,允许模块化导入。
  • __init__.py: It marks the folder as a Python package, allowing modular imports.

__pycache__目录存储编译后的字节码,用于在执行过程中进行性能优化。这种结构体现了良好的模块化设计和清晰的职责分离。我们将在下一节中详细讨论这些内容。

The __pycache__ directory stores compiled bytecode for performance optimization during execution. This structure reflects good modular design and clear separation of concerns. Let us discuss these in more detail in the following section.

文件资源管理器显示 rag 文件夹包含 Python 文件和 __pycache__ 文件夹,其中保存着扩展名为 .cpython-311.pyc 的已编译 Python 文件。

图 7.3:检索系统的文件夹结构

Figure 7.3: Folder structure of retrieval system

装载机

Loaders

该代码定义了两个实用函数 ` load_pdfs_and_texts()``load_images()` ,分别用于接收和预处理文本数据和图像数据。这些函数是多模态检索系统的关键组件,有助于创建跨文档和图像模态的语义对齐嵌入。该实现利用了 LangChain 框架进行结构化文档处理和文本分块,并采用了一种系统化的方法来准备原始数据,以便将其嵌入到下游向量数据库中并建立索引。下面我们详细了解一下代码:

The code defines two utility functions, load_pdfs_and_texts() and load_images(), to ingest and preprocess textual and visual data, respectively. These functions serve as critical components in a multimodal retrieval system, facilitating the creation of semantically aligned embeddings across document and image modalities. The implementation leverages the LangChain framework for structured document handling and text chunking and adopts a principled approach to prepare raw data for embedding and indexing in downstream vector databases. Let us understand the code in detail:

  • 文本数据摄取和分块:这部分从 LangChain 生态系统导入关键实用程序:
    • PyPDFLoader :它是一个专门用于将 PDF 文件解析为结构化文本的文档加载器。
    • RecursiveCharacterTextSplitter :这是一个强大的文本分块工具,可以在分割长文本的同时保留语义边界。
    • 文档:一种结构化数据类,封装了内容和元数据。
    • os :用于文件系统遍历。

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    from langchain.schema import Document

    导入操作系统

  • Textual data ingestion and chunking: This portion imports key utilities from the LangChain ecosystem:
    • PyPDFLoader: It is a document loader specifically for parsing PDF files into structured text.
    • RecursiveCharacterTextSplitter: It is a robust text chunking utility that preserves semantic boundaries while segmenting long text.
    • Document: A structured data class encapsulating content and metadata.
    • os: Used for filesystem traversal.

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    from langchain.schema import Document

    import os

  • 函数:load_pdfs_and_texts(folder_path: str) :此函数负责遍历目录,识别.pdf.txt文件,并将它们处理成分块的文档对象:

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    • 块大小:设置为1000 个字符,以符合嵌入模型的典型标记限制。
    • 重叠:设置为200 个字符,以确保相邻块之间的上下文连续性——这对于在检索过程中保持语义连贯性至关重要。

      for fname in os.listdir(folder_path):

      文件夹中的每个文件都会根据特定条件进行处理:

      • PDF 文件:使用PyPDFLoader加载,该加载器返回结构化的页面级内容。然后使用递归分割器将文本分割成块。
      • 文本文件:以整个字符串的形式读取,分成重叠的片段,然后包装成Document对象,并将文件名存储为元数据

        文档(page_content=chunk,metadata={"source": fname})

      这种元数据能够实现下游可追溯性和来源归属,这对于需要溯源性的基于检索的应用来说至关重要。

  • Function: load_pdfs_and_texts(folder_path: str): This function is responsible for traversing a directory, identifying .pdf and .txt files, and processing them into chunked Document objects:

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    • Chunk size: Set to 1000 characters to align with typical token limits of embedding models.
    • Overlap: Set to 200 characters to ensure context continuity across adjacent chunks—crucial for preserving semantic coherence during retrieval.

      for fname in os.listdir(folder_path):

      Each file in the folder is processed conditionally:

      • PDF files: Loaded using PyPDFLoader, which returns structured page-level content. The text is then split into chunks using the recursive splitter.
      • Text files: Read as a whole string, chunked into overlapping segments, and then wrapped into Document objects with the filename stored as metadata.

        Document(page_content=chunk, metadata={"source": fname})

      This metadata enables downstream traceability and source attribution, which are essential in retrieval-based applications where provenance is required.

  • 函数:load_images(folder_path: str) :此函数通过扫描文件夹以查找支持的图像格式并将文件路径包装到Document实例中来处理视觉模式

    Document(page_content=os.path.join(folder_path, f), metadata={"image": f})

    • page_content存储完整的文件路径,稍后将传递给图像编码器(例如 CLIP)。
    • 元数据包含图像文件名,便于检索后进行反向查找和用户界面显示。

    该设计符合 LangChain 的文档模式,确保文本和图像输入尽管来自不同的模态,但都与统一的文档处理流程兼容。

  • Function: load_images(folder_path: str): This function handles the visual modality by scanning a folder for supported image formats and wrapping file paths into Document instances:

    Document(page_content=os.path.join(folder_path, f), metadata={"image": f})

    • page_content stores the full file path, which will later be passed to an image encoder (e.g., CLIP).
    • metadata contains the image filename, facilitating reverse lookup and UI display post-retrieval.

    This design aligns with LangChain's Document schema, ensuring that both text and image inputs are compatible with a unified document-processing pipeline despite originating from different modalities.

  • 设计:这些实用程序体现了基于矢量的多模态系统数据预处理的最佳实践:
    • 它们通过 LangChain 的文档模式保持结构统一性,从而实现与模态无关的处理。
    • 分块策略通过平衡粒度和语义保留来提高检索精度。
    • 元数据标记增强了可检索性、可追溯性以及下游可视化或摘要的潜力。
  • Design: These utilities reflect best practices in data preprocessing for vector-based multimodal systems:
    • They maintain structural uniformity through LangChain's Document schema, enabling modality-agnostic handling.
    • The chunking strategy improves retrieval precision by balancing granularity with semantic preservation.
    • Metadata tagging enhances retrievability, traceability, and potential for downstream visualization or summarization.

在这些情况下,此类预处理例程是构建 RAG、语义搜索和跨模态对齐系统的基础,其中跨模态的一致文档表示至关重要。

In these settings, such preprocessing routines are foundational to building systems for RAG, semantic search, and cross-modal alignment, where consistent document representations across modalities are critical.

嵌入工具

Embedding utils

这段代码定义了一个函数,用于使用llama_index库初始化 Hugging Face 嵌入模型,以用于多模态任务。它llama_index.embeddings.huggingface导入HuggingFaceEmbedding ,并将模型设置为openai/clip-vit-base-patch32 ,这是一个流行的 CLIP 模型,可以将文本和图像联合嵌入到同一个语义空间中。函数get_mm_embedder()接受一个设备参数(例如,cpucuda ),并返回一个嵌入接口,该接口可以为两种模态生成向量表示。trust_remote_code =True标志允许执行来自 Hugging Face 代码库的自定义代码,从而实现更灵活的模型加载。

This code snippet defines a function to initialize a Hugging Face embedding model for multimodal tasks using the llama_index library. It imports HuggingFaceEmbedding from llama_index.embeddings.huggingface and sets the model to openai/clip-vit-base-patch32, a popular CLIP model that jointly embeds text and images in the same semantic space. The function get_mm_embedder() accepts a device argument (e.g., cpu or cuda) and returns an embedding interface that can generate vector representations for both modalities. The trust_remote_code=True flag allows execution of custom code from Hugging Face repositories, enabling more flexible model loading.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

_MODEL_ID = "openai/clip-vit-base-patch32"

_MODEL_ID = "openai/clip-vit-base-patch32"

def get_mm_embedder(device: str = "cpu"):

def get_mm_embedder(device: str = "cpu"):

返回 HuggingFaceEmbedding(model_name=_MODEL_ID, device=device, trust_remote_code=True)

return HuggingFaceEmbedding(model_name=_MODEL_ID, device=device, trust_remote_code=True)

本代码中使用的CLIP 模型(openai/clip-vit-base-patch32 )是一个双编码器,因为它使用独立的神经网络分支分别将文本和图像编码成向量。文本和图像这两种模态分别通过各自的编码器(文本使用 Transformer,图像使用 ViT)并行处理,推理过程中无需直接交互。最终得到的嵌入向量被投影到一个共享的潜在空间,并在该空间中计算两者之间的相似度(例如,余弦距离)。这种架构计算效率高,通过预先分别计算每种模态的嵌入向量,可以快速完成图像到文本或文本到图像的匹配等检索任务。

The CLIP model (openai/clip-vit-base-patch32) used in this code functions as a bi-encoder because it independently encodes text and images into vectors using separate neural network branches. Each modality, text and image, is processed in parallel through its respective encoder (a transformer for text and a ViT for images), without direct interaction during inference. The resulting embeddings are projected into a shared latent space, where similarity (e.g., cosine distance) is computed between the two. This architecture is computationally efficient and enables fast retrieval tasks like image-to-text or text-to-image matching by precomputing embeddings for each modality separately.

索引构建器

Index builder

index_builder.py函数实现了多模态检索系统的核心原则:将不同模态的数据编码到语义对齐的向量空间中,并存储起来以进行快速的、基于相似性的搜索。通过按模态分离收集逻辑并使用余弦归一化嵌入,该实现遵循了向量搜索架构的最佳实践。该设计采用模块化结构,易于扩展到其他模态(例如音频或表格数据),并且已准备好用于 RAG(红绿灯)、跨模态搜索和智能信息访问等生产环境。接下来,我们将进一步探索index_builder.py

The index_builder.py function operationalizes a central tenet of multimodal retrieval systems: encoding disparate modalities into semantically aligned vector spaces and storing them for fast, similarity-based search. By separating collection logic by modality and using cosine-normalized embeddings, the implementation adheres to best practices in vector search architecture. The design is modular, easily extensible to additional modalities (e.g., audio or tabular data), and production-ready for RAG, cross-modal search, and intelligent information access. Let us explore index_builder.py further.

`build_vectorstores()`函数在 Qdrant 向量数据库的本地实例中构建并填充两个独立的向量集合——一个用于文本,一个用于图像。这使得在多模态检索系统中能够实现基于相似性的快速检索,用户查询可以是文本性质的,也可以是图像性质的。

The function build_vectorstores() constructs and populates two separate vector collections—one for text and one for images—within a local instance of the Qdrant vector database. This enables fast, similarity-based retrieval in multimodal retrieval systems, where user queries may be textual or visual in nature.

该实现方案利用模块化组件进行文档摄取、嵌入生成、规范化和存储,确保架构具有可扩展性、可解释性和易于扩展性。下面我们详细了解一下这些组件:

The implementation leverages modular components for document ingestion, embedding generation, normalization, and storage, ensuring that the architecture remains scalable, interpretable, and easily extendable. Let us understand them in detail:

  • 模块导入和配置:这些导入启用:
    • 文档 I/O(路径加载器
    • 嵌入生成(get_mm_embedder
    • 向量归一化(numpy.linalg.norm
    • 基于 Qdrant 的向量索引(QdrantClient PointStruct
    • 类型安全(List[Document]

      from pathlib import Path

      从 typing 导入 List

      from rag.embedding_utils import get_mm_embedder

      from qdrant_client import QdrantClient, models

      from rag.loaders import load_pdfs_and_texts, load_images

      from langchain.schema import Document

      from numpy.linalg import norm

    定义常量以指向存储路径和集合名称:

    DB_PATH = "data/qdrant_mm"

    TEXT_COLLECTION = "vdr_text"

    图像集合 = "vdr_images"

  • Module imports and configuration: These imports enable:
    • Document I/O (Path, loaders)
    • Embedding generation (get_mm_embedder)
    • Vector normalization (numpy.linalg.norm)
    • Qdrant-based vector indexing (QdrantClient, PointStruct)
    • Type safety (List[Document])

      from pathlib import Path

      from typing import List

      from rag.embedding_utils import get_mm_embedder

      from qdrant_client import QdrantClient, models

      from rag.loaders import load_pdfs_and_texts, load_images

      from langchain.schema import Document

      from numpy.linalg import norm

    Constants are defined to point to storage paths and collection names:

    DB_PATH = "data/qdrant_mm"

    TEXT_COLLECTION = "vdr_text"

    IMAGE_COLLECTION = "vdr_images"

  • 向量归一化

    def normalize(vecs):

    返回 [v /norm(v) for v in vecs]

    此函数对所有向量执行 L2 归一化,以确保嵌入向量的长度为单位长度。这在使用余弦相似度时至关重要,因为余弦相似度仅依赖于向量的方向而非长度。归一化保证了在高维嵌入空间中距离计算的一致性。

  • Vector normalization:

    def normalize(vecs):

    return [v / norm(v) for v in vecs]

    This function performs L2 normalization on all vectors to ensure unit-length embeddings. This is essential when using cosine similarity, which depends solely on vector direction rather than magnitude. Normalization guarantees consistent distance calculations in the high-dimensional embedding space.

  • 文档加载和嵌入生成

    text_docs: List[Document] = load_pdfs_and_texts("data/documents")

    image_docs: List[Document] = load_images("data/images")

    文本和图像文件分别从不同的目录中导入,使用预定义的加载器将每个条目包装成 LangChain 风格的文档对象,从而保留内容和元数据。

  • Document loading and embedding generation:

    text_docs: List[Document] = load_pdfs_and_texts("data/documents")

    image_docs: List[Document] = load_images("data/images")

    Text and image files are ingested from separate directories using pre-defined loaders that wrap each entry into LangChain-style Document objects, preserving content and metadata.

embedder = get_mm_embedder()

embedder = get_mm_embedder()

初始化多模态嵌入器(通常基于 CLIP)。该嵌入器提供以下功能:

The multimodal embedder (typically CLIP-based) is initialized. This embedder provides:

  • get_text_embedding_batch() :用于对多个文本块进行编码
  • get_text_embedding_batch(): For encoding multiple text chunks
  • get_image_embedding_batch() :用于编码文件路径或图像张量

    text_vecs = normalize(embedder.get_text_embedding_batch([...]))

    image_vecs = normalize(embedder.get_image_embedding_batch([...]))

  • get_image_embedding_batch(): For encoding file paths or image tensors

    text_vecs = normalize(embedder.get_text_embedding_batch([...]))

    image_vecs = normalize(embedder.get_image_embedding_batch([...]))

每种模态的嵌入都经过规范化处理,并准备插入到 Qdrant 中。

Each modality’s embeddings are normalized and prepared for insertion into Qdrant.

  • QdrantClient 初始化和集合创建

    client = QdrantClient(path=DB_PATH)

    使用位于data/qdrant_mm的持久存储初始化本地 Qdrant 实例

    如果 client.collection_exists(...):

    client.create_collection(...)

    如果两个独立的向量集合尚不存在,则会创建它们:

    • TEXT_COLLECTION :用于文本块嵌入
    • 图像集合:用于视觉嵌入

      两者都使用余弦相似度作为距离度量,维度(size=dim )由第一个嵌入向量推断得出。这种集合分离确保了每个模态的优化检索,同时允许后续采用混合或后期融合检索策略。

  • QdrantClient initialization and collection creation:

    client = QdrantClient(path=DB_PATH)

    A local Qdrant instance is initialized with persistent storage located at data/qdrant_mm.

    if not client.collection_exists(...):

    client.create_collection(...)

    Two separate vector collections are created if they do not exist already:

    • TEXT_COLLECTION: For textual chunk embeddings
    • IMAGE_COLLECTION: For visual embeddings

      Both use cosine similarity as the distance metric, and the dimensionality (size=dim) is inferred from the first embedding vector. This separation of collections ensures optimized retrieval per modality while allowing for hybrid or late-fusion retrieval strategies later.

  • 将点插入 Qdrant 集合

    client.upload_points(

    文本集合,

    [models.PointStruct(id=i, vector=text_vecs[i], payload={"source": d.page_content}) for i, d in enumerate(text_docs)],

  • Inserting points into Qdrant collections:

    client.upload_points(

    TEXT_COLLECTION,

    [models.PointStruct(id=i, vector=text_vecs[i], payload={"source": d.page_content}) for i, d in enumerate(text_docs)],

    )

  • 文本嵌入存储在
    • id :整数索引
    • 向量:归一化嵌入
    • 有效载荷:元数据(原始数据块文本)

      client.upload_points(

      图像集合,

      [models.PointStruct(id=i, vector=image_vecs[i], payload={"image": Path(d.page_content).name}) for i, d in enumerate(image_docs)],

    图像嵌入也以类似方式上传,图像文件名作为元数据存储。这些元数据对于后续的检索操作至关重要,因为查询结果必须呈现给用户或用于进一步的推理。

  • Textual embeddings are stored with:
    • id: An integer index
    • vector: The normalized embedding
    • payload: Metadata (the original chunk text)

      client.upload_points(

      IMAGE_COLLECTION,

      [models.PointStruct(id=i, vector=image_vecs[i], payload={"image": Path(d.page_content).name}) for i, d in enumerate(image_docs)],

      )

    Image embeddings are similarly uploaded, with the image filename stored as metadata. This metadata is critical for subsequent retrieval operations where query results must be rendered back to the user or used for further reasoning.

  • 返回值
  • Return values:

返回客户端,嵌入器

return client, embedder

该函数返回以下两项:

The function returns both:

  • 客户支持下游检索操作
  • The client, enabling downstream retrieval operations
  • 嵌入允许将查询输入转换为同一潜在空间中的向量。
  • The embedder, allowing query inputs to be transformed into vectors in the same latent space

运行整个代码的过程

Process to run the entire code

执行前请确保以下事项:

Ensure the following before executing:

  • rag /模块(包含index_builder.py embedding_utils.pyloaders.py )存在且可导入。
  • The rag/module (with index_builder.py, embedding_utils.py, and loaders.py) exists and is importable.
  • 目录data/documents/data/images/包含有效文件。
  • Directories data/documents/ and data/images/ contain valid files.
  • 所有必需的库(例如QdrantClient 、LangChain、transformers 等)均已安装。
  • All required libraries (e.g., QdrantClient, LangChain, transformers, etc.) are installed.

该脚本通常运行一次,要么是在语料库设置期间,要么是在语料库重新索引期间。

This script is typically run once, either during setup or re-indexing of your corpus.

以下代码片段用于执行完整的嵌入和索引流程。它首先从rag.index_builder模块导入build_vectorstores函数,该函数负责加载文档和图像,生成它们的嵌入向量,并将生成的向量存储在 Qdrant 向量数据库中。此函数封装了构建多模态向量库所需的所有关键组件。

The following code snippet is for executing the entire embedding and indexing pipeline. It begins by importing the build_vectorstores function from the rag.index_builder module, which is responsible for loading documents and images, generating their embeddings, and storing the resulting vectors in a Qdrant vector database. This function encapsulates all the key components required for preparing a multimodal vector store.

from rag.index_builder import build_vectorstores

from rag.index_builder import build_vectorstores

构建向量存储()

build_vectorstores()

print("嵌入完成。矢量存储已加载到 Qdrant 中。")

print(" Embedding complete. Vector stores loaded into Qdrant.")

调用`build_vectorstores()`函数,系统会执行以下几个任务:从`data/documents`目录读取文本数据,`data/images`目录读取图像数据;使用共享的多模态嵌入器(例如 CLIP)为两种模态生成归一化的向量嵌入;如果 Qdrant 数据库尚未创建,则对其进行初始化。此外,系统还会创建文本向量和图像向量的独立集合(如果不存在),并将数据及其关联的元数据上传,以便将来检索。

When build_vectorstores() is called, the system performs several tasks: it reads textual data from data/documents and images from data/images, uses a shared multimodal embedder (such as CLIP) to generate normalized vector embeddings for both modalities, and initializes the Qdrant database if it has not already been created. It also creates separate collections for text and image vectors (if they do not exist) and uploads the data along with associated metadata for future retrieval.

最后,` print()`语句确认索引过程已成功执行。该脚本通常在系统设置期间或文档/图像语料库更新时执行一次。运行脚本前,请确保所有依赖项均已安装,并且所需的目录结构以及有效的文档文件均已就位。

Finally, the print() statement confirms the successful execution of the indexing process. This script is typically executed once during system setup or any time the document/image corpus is updated. Before running the script, ensure that all dependencies are installed and that the required directory structure is in place, along with valid content files.

为了读者们

To do for the readers

虽然目前的实现方案建立了一个稳健的多模态检索流程,但需要注意的是,它尚不支持生成式输出。该系统允许……该系统能够根据用户查询高效检索语义相关的文本或图像,但并不具备自然语言生成NLG )功能。这种设计体现了一种经典的纯检索架构。为了将其发展成为功能齐全的RAG系统,我们鼓励读者为该流程添加生成功能。

While the current implementation establishes a robust multimodal retrieval pipeline, it is important to recognize that it does not yet support generative outputs. The system allows for efficient retrieval of semantically relevant text or images based on a user’s query, but stops short of performing natural language generation (NLG). This design reflects a classical retrieval-only architecture. To evolve this into a full-fledged RAG system, readers are encouraged to extend the pipeline with generative capabilities.

首先,读者需要集成一个语言学习模型(LLM),例如 GPT、Llama 或 Mistral,该模型能够利用查询和检索到的内容生成连贯的响应。这需要构建一个包装器,将检索器与生成模块耦合起来。LangChain 或 LlamaIndex 等库提供了 RetrievalQA 或 RAG 链等高级抽象,可以简化这一过程。这些框架允许将检索到的文档直接作为上下文传递给 LLM,从而生成答案、摘要或语义解释等形式的输出。

To begin, readers should integrate a LLM—such as GPT, Llama, or Mistral—that can synthesize coherent responses using both the query and retrieved content. This requires constructing a wrapper that couples the retriever with a generation module. Libraries such as LangChain or LlamaIndex offer high-level abstractions like RetrievalQA or RAG chain, which streamline this process. These frameworks allow retrieved documents to be passed directly as context into the LLM, enabling output generation in the form of answers, summaries, or semantic interpretations.

对于图像嵌入包含在检索结果中的多模态场景,可能需要额外的步骤。由于大多数语言学习模型(LLM)处理文本,读者应使用图像到文本模型将检索到的图像预处理成图像描述,或者使用原生支持图像输入的多模态语言学习模型(例如 GPT-4V、LLaVA 或 Kosmos-2)。这种改进将使系统能够生成跨越视觉和文本领域的上下文描述或见解。

For multimodal scenarios where image embeddings are part of the retrieval results, an additional step may be required. Since most LLMs operate on text, readers should either preprocess retrieved images into captions using image-to-text models or employ multimodal LLMs (e.g., GPT-4V, LLaVA, or Kosmos-2) that natively support image inputs. This enhancement will allow the system to generate contextualized descriptions or insights that span both visual and textual domains.

总之,希望扩展此项目的读者应重点关注以下方面:

In summary, readers seeking to extend this project should focus on:

  • 集成合适的生成推理逻辑模型。
  • Integrating a suitable LLM for generative reasoning.
  • 将检索器和 LLM 封装到检索生成管道中。
  • Wrapping the retriever and LLM into a retrieval-generation pipeline.
  • (可选)对非文本内容进行图像到文本的预处理。
  • Optionally implementing image-to-text preprocessing for non-textual content.
  • 设计提示模板,指导LLM有效利用检索到的文档。
  • Designing prompt templates that instruct the LLM to leverage retrieved documents effectively.

这种生成式扩展不仅将系统从语义匹配提升到智能推理,而且还使其与当前多模态问答和文档理解领域的最先进水平保持一致。

This generative extension not only elevates the system from semantic matching to intelligent reasoning but also aligns it with the current state-of-the-art in multimodal question answering and document understanding.

结论

Conclusion

本章指导读者设计并实现一个多模态检索系统,该系统利用向量嵌入技术整合文本和图像输入。本章演示了如何预处理文档和图像,如何利用诸如 CLIP 之类的双编码器对其进行嵌入,以及如何将其存储在 Qdrant 向量数据库中以实现高效的语义搜索。该系统支持跨模态查询(文本到图像、图像到文本),并为实际应用奠定了坚实的基础。虽然目前的配置能够实现检索,但未来的扩展计划是集成用于 RAG 的 LLM,从而使系统能够生成跨模态的连贯且上下文感知的输出。

This chapter guided the reader through the design and implementation of a multimodal retrieval system that integrates text and image inputs using vector embeddings. It demonstrated how to preprocess documents and images, embed them with bi-encoders like CLIP, and store them in a Qdrant vector database for efficient semantic search. The system supports cross-modal querying (text-to-image, image-to-text) and establishes a solid foundation for real-world applications. While the current setup enables retrieval, a future extension involves integrating LLMs for RAG, allowing the system to generate coherent, context-aware outputs across modalities.

在下一章中,我们将通过构建一个完整的多模态检索和生成系统来实现缺失的生成组件。

In the next chapter, we will implement the missing generative component by building a complete multimodal retrieval and generation system.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第八章构建多模式 RAG 系统

CHAPTER 8Building a Multimodal RAG System

介绍

Introduction

本章我们将解决最后一个环节,即第七章“构建双向多模态检索系统”中的生成组件,方法是将我们的多模态检索流程扩展为一个完整的检索增强生成RAG )系统。到目前为止,我们主要关注文档和图像的索引、将它们嵌入到共享的向量空间以及根据用户查询检索相关的文本或图像。在本章中,我们将集成一个大型语言模型LLM ),利用检索到的条目合成连贯且上下文相关的响应。我们将演示如何将检索器封装在生成链中,如何设计能够融合用户查询和检索到的上下文的提示模板,以及如何处理文本到图像和图像到文本的工作流程。在本章结束时,您将拥有一个完整的端到端多模态系统,该系统不仅能够查找相关内容,还能生成富有洞察力的答案和摘要。

In this chapter, we address the one remaining piece, the generative component from Chapter 7, Building a Bidirectional Multimodal Retrieval Systems, by extending our multimodal retrieval pipeline into a full retrieval-augmented generation (RAG) system. Up to now, we have focused on indexing documents and images, embedding them into a shared vector space, and retrieving relevant text or visuals based on user queries. Here, we will integrate a large language model (LLM) to synthesize coherent, context-aware responses using those retrieved items. We will demonstrate how to wrap the retriever in a generation chain, craft prompt templates that blend the user’s query with retrieved context, and handle both text-to-image and image-to-text workflows. By the end of this chapter, you will have a complete, end-to-end multimodal system capable not only of finding relevant content but also of generating insightful answers and summaries.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 生成过程的实现
  • Implementation of generation
  • 基于多模态LLM的推荐系统
  • Multimodal LLM-based recommender system
  • 将评分与 OpenAI 结合使用
  • Incorporate grading with OpenAI
  • 法学硕士担任法官
  • LLM-as-a-judge
  • 待办事项
  • To do

目标

Objectives

本章全面概述了学习型学习模型(LLM)系统中的高级评估和推荐策略。首先,本章探讨了生成技术,重点阐述了LLM如何生成上下文感知输出以驱动下游任务。在此基础上,本章介绍了多模态推荐方法,这些方法整合了文本、图像和其他数据模态,以提升个性化推荐和用户参与度。为了确保生成和推荐输出的质量和相关性,本章探讨了评分机制,这是一种由LLM驱动的自动化评估技术,用于评估检索的准确性、一致性和真实性。这些评分策略构成了新兴的“LLM即评判员”范式的基础,在该范式中,LLM不仅负责生成响应,还负责对响应进行排序和验证。这种相互关联的视角强调了生成、推荐和评分如何协同工作,从而支持可扩展、可信赖的人工智能系统。

This chapter provides a comprehensive overview of advanced evaluation and recommendation strategies in LLM systems. It begins by examining generation techniques, highlighting how LLMs produce context-aware outputs that drive downstream tasks. Building on this, it introduces multimodal recommendation methods that integrate text, image, and other data modalities to improve personalization and user engagement. To ensure the quality and relevance of these generated and recommended outputs, the chapter explores grading mechanisms, automated assessment techniques powered by LLMs that evaluate retrieval accuracy, coherence, and factuality. These grading strategies form the basis for the emerging paradigm of LLM-as-judge, where the LLM is tasked not only with generating responses but also with ranking and validating them. This interconnected view underscores how generation, recommendation, and grading work in concert to support scalable, trustworthy AI systems.

生成过程的实现

Implementation of generation

本节以第 7 章“构建双向多模态检索系统”中介绍的基本概念为基础,通过将我们的多模态检索管道扩展到完整的 RAG 系统,实现了生成组件。

Building upon the foundational concepts introduced in Chapter 7, Building Bidirectional Multimodal Retrieval Systems, this section offers an implementation of the generative component by extending our multimodal retrieval pipeline into a full RAG system.

在前一章中,我们实现了图 8.1 所示多模态检索系统架构,该架构能够无缝地在文本和图像模态之间进行检索。从学术角度来看,这种方法利用了嵌入(一种基于向量的表示方法,用于捕捉数据中的语义关系),并分别针对文本和图像内容生成嵌入。

In the preceding chapter, we implemented Figure 8.1, a multimodal retrieval system architecture, where retrieval operates seamlessly across text and image modalities. Academically, this approach leverages embeddings, a vector-based representation capturing semantic relationships within data, generated separately for textual and visual content.

该过程始于用户查询,查询内容可以包含文本、图像或两者兼有。这些输入会被传递给专门的嵌入模型:文本查询或文档使用文本嵌入模型,图像输入使用图像嵌入模型。文档会被分块成更小的单元,以提高检索的粒度和效率,而图像则直接嵌入到向量表示中。

The process initiates with user queries, which may consist of text, images, or both. These inputs are passed to specialized embedding models: a text embedding model for textual queries or documents, and an image embedding model for visual inputs. The documents undergo chunking into smaller units to improve the granularity and efficiency of retrieval, whereas images are directly embedded into the vector representation.

词嵌入计算完成后,会被存储在一个多模态向量数据库中,该数据库旨在处理混合数据类型。系统接收到查询后,会在该数据库中执行向量相似性搜索,并基于语义接近性而非精确匹配来检索结果。最终返回的结果(包含文本块和图像)会呈现给用户。

Once embeddings are computed, they are stored in a multimodal vector database designed to handle mixed data types. Upon receiving a query, the system performs vector similarity searches across this database, retrieving results based on semantic proximity rather than exact matches. The returned results, combining textual chunks and images, are then provided back to the user.

流程图展示了文本和图像嵌入模型如何将文档和图像转换为矢量嵌入,这些矢量嵌入存储在矢量数据库中,并用于查询多模态搜索结果。

图 8.1:多模态检索系统

Figure 8.1: Multimodal retrieval system

本章将实现如图 8.2所示的生成式人工智能( GenAI ) 系统的生成部分,具体来说,就是图中圆圈部分。图 8.2中的多模态 RAG 流水线能够将文本和图像数据无缝集成到一个统一的语义搜索和生成系统中。该架构强调模块化和可扩展性,使其适用于知识检索、视觉问答( VQA ) 和人工智能驱动的文档理解等多种应用。通过利用能够跨多种模态存储和搜索的向量数据库,该系统能够基于文本和视觉证据,实现更丰富的交互和更准确的响应。

In this chapter, we will implement the generation part of the generative AI (GenAI) system as shown in Figure 8.2, specifically the portion of the circle. This multimodal RAG pipeline in Figure 8.2 enables seamless integration of text and image data into a unified semantic search and generation system. The architecture emphasizes modularity and extensibility, making it suitable for a wide range of applications in knowledge retrieval, visual question answering (VQA), and AI-powered document understanding. By utilizing a vector database capable of storing and searching across multiple modalities, the system facilitates richer interaction and more accurate responses, grounded in both textual and visual evidence.

该流程图展示了用户查询如何被文本和图像嵌入模型处理,并存储在矢量数据库中。检索到的数据块被发送到逻辑层模型(LLM),由LLM生成结果并返回给用户。

图 8.2:多模态 RAG 系统

Figure 8.2: Multimodal RAG system

架构组件和工作流程

Architectural components and workflow

图 8.2所示的系统概述了一个多模态信息检索和生成框架,该框架整合了文本和视觉数据,以增强用户交互。该架构利用特定模态的嵌入模型和统一的向量数据库来检索相关信息,随后由 LLM 将其合成为连贯的响应。该设计针对需要跨模态推理的应用进行了优化,例如视觉问答 (VQA)、图像增强文档搜索或交互式多模态助手。

The presented system in Figure 8.2 outlines a multimodal information retrieval and generation framework that integrates both textual and visual data to support enhanced user interaction. This architecture leverages modality-specific embedding models and a unified vector database to retrieve relevant information, subsequently synthesized into a coherent response by LLM. The design is optimized for applications requiring cross-modal reasoning, such as VQA, document search with image augmentation, or interactive multimodal assistants.

以下部分介绍了一个完整的端到端流程,用于构建一个无缝集成文本和视觉数据的多模态 RAG 系统。用户可以提交文本或图像查询,这些查询会通过专门的嵌入模型进行路由,生成统一的矢量表示。这些嵌入存储在共享的矢量数据库中,从而支持跨文档块和图像内容的跨模态相似性搜索。检索结果中,前 k 个最相关的结果将作为响应生成过程的基础,该过程由 LLM 提供支持。最终输出是一个包含丰富上下文的自然语言响应,反映了查询意图和嵌入的知识。以下列表概述了所有必需的模块:

The following section presents a comprehensive end-to-end pipeline for building a multimodal RAG system that seamlessly integrates textual and visual data. Users can submit either text or image queries, which are routed through specialized embedding models to produce unified vector representations. These embeddings are stored in a shared vector database, enabling cross-modal similarity search across both document chunks and image content. Upon retrieval, the top-k relevant results ground the response generation process, powered by a LLM. The final output is a context-rich natural language response reflecting both the query intent and embedded knowledge. The following list outlines all the required modules:

  • 用户查询界面
    • 用户提交查询,查询可以是文本形式的,也可以是多模态形式的。
    • 根据查询类型,该输入将被路由到相应的嵌入模型。
  • User query interface:
    • Users submit a query, which may be textual or multimodal in nature.
    • This input is routed to the appropriate embedding model depending on the query type.
  • 嵌入模型
    • 该系统采用两种不同的嵌入流程:
      • 文本嵌入模型,可将基于文本的内容(例如文档)和文本查询转换为向量表示。
      • 图像嵌入模型,将视觉内容(例如图像或屏幕截图)编码到可比较的向量空间中。
    • 这些嵌入模型确保文档和图像都被投影到统一的潜在空间中,从而实现跨模态相似性搜索。
  • Embedding models:
    • The system employs two distinct embedding pipelines:
      • A text embedding model, which transforms text-based content (e.g., documents) and text queries into vector representations.
      • An image embedding model, which encodes visual content (e.g., images or screenshots) into a comparable vector space.
    • These embedding models ensure that both documents and images are projected into a unified latent space, enabling cross-modal similarity search.
  • 文档和图像导入
    • 首先将文档分割成更小的片段,以便进行细粒度的嵌入和检索。
    • 图像直接编码,不进行分块处理。
    • 两种类型的内容都嵌入并存储在一个能够处理多模态矢量表示的共享矢量数据库中。
  • Document and image ingestion:
    • Documents are first chunked into smaller segments for fine-grained embedding and retrieval.
    • Images are encoded directly without chunking.
    • Both types of content are embedded and stored in a shared vector database capable of handling multimodal vector representations.
  • 具有多模态嵌入的向量数据库
    • 该数据库维护文本和图像模态的索引向量表示。
    • 当查询嵌入时,相似性搜索会从矢量存储中检索最相关的条目(文本块和/或图像)。
  • Vector database with multimodal embeddings:
    • This database maintains the indexed vector representations of both text and image modalities.
    • When a query is embedded, a similarity search retrieves the most relevant entries (text chunks and/or images) from the vector store.
  • 向量搜索结果
    • 检索到的结果(前 k 个相似向量)构成了后续生成的上下文基础。
    • 这一步骤确保只有语义最相关的文档或图像才能用于最终的输出合成。
  • Vector search results:
    • The retrieved results (top-k similar vectors) form the contextual grounding for subsequent generation.
    • This step ensures that only the most semantically relevant documents or images contribute to the final output synthesis.
  • 法学硕士
    • LLM 利用检索到的上下文和原始查询,生成全面且具有上下文感知能力的自然语言响应。
    • LLM 处理文本,但受益于可能源自文本或图像嵌入的上下文。
  • LLM:
    • An LLM consumes the retrieved context and the original query to generate a comprehensive and context-aware natural language response.
    • The LLM operates on text, but benefits from context that may have originated from either text or image embeddings.
  • 输出交付
    • 由 LLM 合成的最终响应将返回给用户。
    • 用户将获得一个包含上下文信息的搜索结果,该结果既反映了查询意图,也反映了数据库中蕴含的潜在知识。
  • Output delivery:
    • The final response, synthesized by the LLM, is returned to the user.
    • The user receives a context-enriched result that reflects both the query intent and the latent knowledge embedded in the database.

本章代码包含了导入数据、构建索引、运行检索和生成响应所需的所有 Python 模块。如需详细了解代码块,请参阅第 7 章构建双向多模态检索系统”中的“代码实现与解释”部分以下是一个简要的检查清单:

The code of this chapter contains every Python module you need to ingest data, build your indexes, run retrieval, and generate responses. For a detailed understanding of the code blocks, please refer to Chapter 7, Building a Bidirectional Multimodal Retrieval Systems, section: Code implementation and explanation, a quick checklist is as follows:

  • loaders.py
    • 使用 load_pdfs_and_texts()load_images()函数读取原始文件
  • loaders.py:
    • load_pdfs_and_texts() and load_images() to read your raw files
  • embedding_utils.py
    • get_mm_embedder()封装了基于 CLIP 的 Hugging Face 嵌入器
  • embedding_utils.py:
    • get_mm_embedder() wrapping the CLIP-based Hugging Face embedder
  • retriever.py
    • 使用build_vectorstores()函数加载、嵌入、归一化向量并将其上传到 Qdrant。
    • retrieve_by_text() / retrieve_by_image()用于语义查找
  • retriever.py:
    • build_vectorstores() to load, embed, normalize, and upload vectors to Qdrant
    • retrieve_by_text()/retrieve_by_image() for semantic lookup
  • run_once.py
    • 用于在运行应用程序之前填充 Qdrant 集合的一次性脚本
  • run_once.py:
    • One-off script to populate your Qdrant collections before running the app
  • app.py
    • Streamlit UI 将所有内容整合在一起
  • app.py:
    • Streamlit UI tying everything together

检索流程搭建完毕后,本节将重点转向生成器,它在将检索到的上下文转换为自然语言响应方面起着至关重要的作用。其余部分代码与之前的设置保持一致,重点在于generator.py模块:

With the retrieval pipeline in place, this section now shifts focus to the generator, which plays a pivotal role in transforming retrieved context into natural language responses. While the rest of the code remains consistent with the earlier setup, the emphasis here is on the generator.py module:

  • generator.py
    • init_generator()generate_response()用于封装 LLM(例如 GPT-3.5)
  • generator.py:
    • init_generator() and generate_response() to wrap an LLM (e.g. GPT-3.5)

该组件为 RAG 工作流程中的文本生成提供了一个简洁且模块化的接口,将模型初始化(init_generator )与实际生成逻辑(generate_response )清晰地分离。这种设计提高了可重用性,简化了集成,并且与提示工程和 LLM 抽象的最佳实践高度契合。

This component offers a clean and modular interface for text generation within the RAG workflow, clearly separating model initialization (init_generator) from the actual generation logic (generate_response). This design promotes reusability, simplifies integration, and aligns well with best practices in prompt engineering and LLM abstraction.

发电机

Generator

其余代码保持不变;然而,主要关注点在于生成器部分。generator.py模块为 RAG 设置中的文本生成提供了一个简洁且模块化的接口。它将模型初始化(init_generator )与生成过程(generate_response )分离,从而提高了设计的可重用性和清晰度。该架构符合提示工程和模型抽象方面的最佳实践。

The rest of the code remains the same; however, the main focus is on the generator part. The generator.py module provides a clean and modular interface for text generation in a RAG setup. It separates the model initialization (init_generator) from the generation process (generate_response), promoting reusability and clarity in design. The architecture aligns with best practices in prompt engineering and model abstraction.

通过将生成逻辑与检索机制分离,该模块保持了跨多种模态使用的灵活性,前提是上下文能够以文本形式表示。这种解耦在多智能体或多模态系统中尤为重要,因为同一个生成模块可以在不同的输入源中重复使用。

By abstracting the generation logic from retrieval mechanics, the module remains flexible for use across multiple modalities, provided the context can be represented textually. This decoupling is particularly important in multi-agent or multimodal systems where the same generation module can be reused across varied sources of input.

  • 函数 1 init_generator()
    • 目的:此函数初始化并返回一个 LangChain LLMChain对象,用于使用通过 Ollama 在本地托管的 LLM 生成响应。
    • 细节
      • ChatOllama是一个 LangChain 封装器,它能够与 Ollama 托管的聊天模型(例如 Llama 3、Mistral 或 Llama 2)进行通信。温度参数设置为0.5 ,以平衡确定性输出和创造性输出。
      • PromptTemplate定义了一个结构化的提示,语言模型使用该提示生成其输出。此模板需要两个输入:查询和上下文。
      • LLMChain将提示信息与选定的模型集成在一起。它形成一个可执行单元,可以接收输入并从模型中生成相应的输出。
    • 提示结构

      你是助理。请根据以下问题和上下文,提供相关且连贯的答案。

      查询:{query}

      语境:

      {语境}

      回答

      此模板旨在通过明确定义模型的角色(助手)以及生成答案前应考虑的输入字段来指导模型的行为。使用独立的查询和上下文部分可确保输入格式的结构化,从而提高模型响应的可靠性和连贯性。

  • Function 1: init_generator()
    • Purpose: This function initializes and returns a LangChain LLMChain object for generating responses using an LLM hosted locally via Ollama.
    • Details:
      • ChatOllama is a LangChain wrapper that enables communication with an Ollama-hosted chat model, such as Llama 3, Mistral, or Llama 2. The temperature parameter is set to 0.5, balancing deterministic and creative outputs.
      • PromptTemplate defines a structured prompt that the language model uses to generate its output. This template expects two inputs: query and context.
      • LLMChain integrates the prompt with the selected model. It forms an executable unit that can accept inputs and produce corresponding outputs from the model.
    • Prompt structure:

      You are an assistant. Based on the following query and context, provide a relevant and coherent answer.

      Query: {query}

      Context:

      {context}

      Answer:

      This template is designed to guide the model's behavior by clearly defining its role (assistant) and the input fields it should consider before generating an answer. The use of distinct sections for Query and Context ensures structured input formatting, improving grounding and coherence in the model’s responses.

  • 函数 2 generate_response(llm_chain, query: str, retrieved: list)
    • 目的:该函数使用预初始化的语言模型链,将用户的查询与从检索组件中派生的上下文相结合,生成响应。
    • 细节
      • 检索到的列表包含上下文项(例如相似文档或图像标题),这些上下文项会使用换行符连接成一个字符串。这个聚合后的字符串将作为模型的上下文输入
      • llm_chain.run ()方法会接收一个包含查询上下文的字典作为参数。LangChain会使用此输入渲染提示符,并将其发送给底层语言模型进行补全。
  • Function 2: generate_response(llm_chain, query: str, retrieved: list)
    • Purpose: This function uses the pre-initialized language model chain to generate a response by combining a user’s query with context derived from a retrieval component.
    • Details:
      • The retrieved list, which contains context items (such as similar documents or image captions), is concatenated into a single string using newline characters. This aggregated string serves as the contextual input to the model.
      • The llm_chain.run() method is called with a dictionary containing the query and context. LangChain renders the prompt using this input and sends it to the underlying language model for completion.

完整的代码可以在第 8 章“构建多模态 RAG 系统”multimodal_rag_system.py部分找到

The end-to-end code can be found in Chapter 8, Building a Multimodal RAG System, section: multimodal_rag_system.py.

在多模态 RAG 的基础上,LLM 利用文本、图像和结构化数据等多种模态来增强信息访问和综合,我们现在过渡到一个相关但又不同的应用:多模态推荐系统。RAG 侧重于检索和生成上下文丰富的响应,而多模态推荐系统则利用类似的跨模态理解来预测和推荐符合用户偏好的相关内容。本章将探讨如何将 RAG 所依赖的相同功能(嵌入对齐、多模态融合和语义理解)应用于跨行业和平台,从而提供高度个性化、多样化且上下文感知的推荐。

Building on the foundations of multimodal RAG, where LLMs leverage diverse modalities such as text, images, and structured data to enhance information access and synthesis, we now transition to a related yet distinct application: multimodal recommendation systems. While RAG focuses on retrieving and generating contextually rich responses, multimodal recommendation systems use similar cross-modal understanding to predict and suggest relevant content tailored to user preferences. This chapter explores how the same capabilities that empower RAG, embedding alignment, multimodal fusion, and semantic understanding, are adapted to deliver highly personalized, diverse, and context-aware recommendations across industries and platforms.

基于多模态LLM的推荐系统

Multimodal LLM-based recommender system

在OTT平台上,多模态LLM MLLM )可以通过整合文本描述、宣传图片、视频缩略图、用户评论和观看历史,彻底革新内容推荐方式。例如,如果一位新用户观看过画面风格阴暗的预告片并阅读过惊悚片的剧情简介,该模型就可以推断出其细微的类型偏好,例如黑色电影。即使观看历史有限,LLM 也能推荐带有心理元素的惊悚片。这使得即使在冷启动场景下也能提供有效的推荐,而依赖元数据或用户相似度的传统系统在冷启动时可能表现不佳。通过整合多模态信号,LLM 可以增强内容发现和用户参与度,并根据用户的潜在和显性偏好定制推荐内容。

On an OTT platform, a multimodal LLM (MLLM) can revolutionize content recommendation by integrating textual descriptions, promotional images, video thumbnails, user reviews, and viewing history. For instance, if a new user watches trailers with dark cinematography and reads thriller plotlines, the model can infer nuanced genre preferences, such as noir thrillers with psychological elements, despite limited watch history. This enables effective recommendations even in cold-start scenarios, where traditional systems relying on metadata or user similarity may falter. By aligning multimodal signals, LLMs enhance both discovery and engagement, tailoring suggestions to the user’s implicit and explicit tastes.

多层逻辑模型(MLLM)能够理解和整合多种数据类型——文本、图像、音频甚至视频——从而发挥强大的推荐引擎作用。传统的推荐系统通常仅依赖于协同过滤或结构化元数据,这在冷启动场景下可能表现不佳,也难以捕捉到用户细微的偏好。相比之下,MLLM 可以从各种内容源和用户交互中提取丰富的高维嵌入,从而实现更加个性化和情境感知的推荐。

A MLLM can function as a powerful recommendation engine by leveraging its ability to understand and integrate diverse data types—text, images, audio, and even video. Traditional recommendation systems often rely solely on collaborative filtering or structured metadata, which can struggle in cold-start scenarios or fail to capture nuanced user preferences. In contrast, MLLMs extract rich, high-dimensional embeddings from varied content sources and user interactions, enabling more personalized and context-aware recommendations.

例如,CLIP 或 GPT-4V 等模型能够理解产品描述和视觉美学,​​因此非常适合推荐时尚、家居装饰或多媒体内容。逻辑逻辑模型 (LLM) 可以总结用户历史记录,从查询中推断用户意图,并将其与跨模态的相关项目进行匹配。它们还支持可解释性,例如为推荐内容生成自然语言解释,从而增强用户信任度和满意度。

For instance, models like CLIP or GPT-4V can understand both product descriptions and visual aesthetics, making them ideal for recommending fashion, home decor, or multimedia content. LLMs can summarize user histories, infer intent from queries, and match them with relevant items across modalities. They also enable explainability, like generating natural language justifications for recommendations, which enhances trust and user satisfaction.

诸如基于协同过滤对齐的MLLM(用于增强序列推荐,简称Molar )、基于LLM的多模态推荐结合用户历史编码和压缩简称HistLLM )以及偶然性MLLM等先进系统,已展现出实际应用价值,在个性化、新颖性和用户参与度指标方面均优于传统方法。凭借分层规划和压缩的用户历史数据,这些模型能够实时提供可扩展且多样化的推荐。随着LLM技术的不断发展,它们有望成为构建下一代多模态推荐引擎的基础,并应用于各个行业。

Advanced systems like MLLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation (Molar), LLM-Based Multimodal Recommendation with User History Encoding and Compression (HistLLM), and serendipitous MLLM have already demonstrated real-world impact, outperforming conventional approaches in personalization, novelty, and engagement metrics. With hierarchical planning and compressed user histories, these models support scalable and diverse recommendations in real-time. As LLMs continue to evolve, they are poised to become foundational in building next-generation, multimodal recommendation engines across industries.

诸如基于图增强的推荐逻辑逻辑模型LLMRec )等新兴架构,通过将逻辑逻辑模型驱动的推理直接嵌入交互图,进一步扩展了这一范式。这些系统不仅解释内容,而且还主动利用逻辑逻辑模型生成的推断关系、丰富的物品元数据和用户意图画像来增强推荐图。通过将逻辑逻辑模型的功能与基于图的模型的结构优势相结合,基于排序的推荐逻辑逻辑模型LlamaRec )能够提升语义深度和推荐准确率,尤其是在数据稀疏的情况下。

Emerging architectures such as LLMs with Graph Augmentation for Recommendation (LLMRec) expand this paradigm further by embedding LLM-driven reasoning directly into interaction graphs. These systems do not just interpret content, but rather, they actively augment recommendation graphs with inferred relationships, enriched item metadata, and user intent profiles generated by LLMs. By combining LLM capabilities with the structural power of graph-based models, LLMs for ranking-based recommendation (LlamaRec) enhance both semantic depth and recommendation accuracy, particularly in sparse data scenarios.

领先架构及示例

Leading architectures and examples

本节探讨了基于大型语言模型(LLM)的多模态推荐系统的领先架构和最新创新。内容涵盖了诸如多模态推荐系统MMRec )、Molar、HistLLM、LLMRec 等模型,这些模型融合了文本、图像和行为信号,以提供个性化、上下文感知且可解释的推荐。本节讨论了多模态嵌入融合、图增强、历史压缩和偶然发现等关键设计策略,以及 Ducho 2.0 和 ATFLRec(基于指令调优大型语言模型的音频-文本融合和低秩自适应多模态推荐系统)等支持工具。此外,本节还概述了一个实用的方案。实施路线图,并重点介绍这些系统在处理冷启动、增强多样性和提高参与度方面的优势。详情如下:

This section explores leading architectures and recent innovations in multimodal recommendation systems powered by LLMs. It covers models like Multimodal Recommender System (MMRec), Molar, HistLLM, LLMRec, and others that integrate text, image, and behavioral signals to deliver personalized, context-aware, and explainable recommendations. Key design strategies such as multimodal embedding fusion, graph augmentation, history compression, and serendipitous discovery are discussed alongside supporting tools like Ducho 2.0 and ATFLRec (A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model). The section also outlines a practical implementation roadmap and highlights the advantages of these systems in handling cold-starts, enhancing diversity, and improving engagement. Details are as follows:

  • MMREC 它从文本和图像中提取嵌入向量,将它们统一到一个共享的潜在空间中,然后传递给深度排序模型。这能够实现精准的、内容感知的推荐,并更好地控制误报。
  • MMREC: It extracts embeddings from text and images, unifies them in a shared latent space, and passes them to a deep ranking model. This enables precise, content-aware recommendations and better false-positive control.
  • Molar :专为序列推荐任务而设计。该模型将多模态项目嵌入与协同过滤信号相结合,以根据不断变化的用户行为实现内容个性化。
  • Molar: It is designed for sequential recommendation tasks. This model aligns multimodal item embeddings with collaborative filtering signals to personalize content based on evolving user behavior.
  • HistLLM :它将用户的完整多模态交互历史压缩成一个提示标记,从而能够使用 LLM 进行更快、更高效的推理,而不会损失上下文保真度。
  • HistLLM: It compresses a user's full multimodal interaction history into a single prompt token, enabling faster and more efficient inference with LLMs without losing contextual fidelity.
  • 偶然性多层级学习模型 (MLLM) :它将高级意图检测与分层规划相结合,提供新颖且相关的推荐,从而提高发现率和用户满意度。
  • Serendipitous MLLM: It blends high-level intent detection with hierarchical planning. It delivers recommendations that are novel yet relevant, increasing discovery and user satisfaction.
  • LLMRec :利用低层模型(LLM)增强传统的用户-物品图,以推断新的用户交互、生成丰富的物品属性并创建文本用户画像。然后,它应用噪声过滤和特征增强技术来稳定训练并提高在稀疏和噪声环境下的鲁棒性。LLMrec 与模型无关,这意味着它可以与现有的图神经网络GNN )或矩阵分解MF流程集成。
  • LLMRec: Augments traditional user-item graphs using LLMs to infer new user interactions, generate rich item attributes, and create textual user profiles. It then applies noise-filtering and feature enhancement techniques to stabilize training and improve robustness across sparse and noisy environments. LLMrec is model-agnostic, meaning it can be integrated with existing graph neural networks (GNNs) or Matrix Factorization (MF) pipelines.

下表对主要的多种模态推荐系统进行了比较概述,总结了它们的核心策略、处理的模态、创新点、冷启动能力、系统兼容性以及所使用的工具或框架。此比较突显了将低层次模型(LLM)与多模态信号集成以提供可扩展、个性化和智能推荐体验的各种方法。

The following table provides a comparative overview of prominent multimodal recommendation systems, summarizing their core strategies, modalities handled, innovations, cold-start capabilities, system compatibility, and the tools or frameworks used. This comparison highlights the diverse approaches through which LLMs are being integrated with multimodal signals to deliver scalable, personalized, and intelligent recommendation experiences.

型号名称

Model name

核心战略

Core strategy

模态融合

Modality fusion

关键创新/优势

Key innovation/strength

冷启动处理

Cold-start handling

兼容性

Compatibility

使用的工具/框架

Tools/ frameworks used

MMREC

MMREC

多模态嵌入 | 深度排序模型。

Multimodal embedding | deep ranking model.

文字+图片

Text + image

将多种模态结合在一个共享的潜在空间中;具有很强的假阳性控制能力。

Combines modalities in a shared latent space; strong false-positive control.

缓和

Moderate

深度排名管道。

Deep ranking pipelines.

PyTorch、Transformers、ResNet-50。

PyTorch, Transformers, ResNet-50.

磨牙

Molar

协同过滤与多模态输入对齐。

Collaborative filtering alignment with multimodal input.

文本 + 图像 + 行为

Text + image + behavior

将项目嵌入与序列中的用户行为对齐。

Aligns item embeddings with user behavior in sequences.

高的

High

顺序推荐系统。

Sequential recommender systems.

PyTorch、Hugging Face、基于自注意力的序列模型SASRec )。

PyTorch, Hugging Face, self-attention-based sequential model (SASRec).

HistLLM

HistLLM

使用LLM提示令牌进行历史记录压缩。

History compression using LLM prompt token.

所有用户交互

All user interactions

将完整的用户历史记录编码到单个令牌中,以便快速推理。

Encodes full user history into a single token for fast inference.

高的

High

基于LLM的推理。

LLM-based inference.

OpenAI API、LangChain、Faiss。

OpenAI API, LangChain, Faiss.

意外的LLM

Serendipitous LLM

基于层级规划的意图建模。

Intent modeling with hierarchical planning.

文本 + 上下文特征

Text + contextual features

既要追求新颖性,又要保持相关性。

Promotes novelty while preserving relevance.

高的

High

个性化探索。

Personalized exploration.

Llama,快速注射,PlannerX。

Llama, prompt injection, PlannerX.

LLMRec

LLMRec

基于LLM的用户图增强+噪声过滤。

LLM-driven user graph augmentation + noise-filtering.

文本 + 图表 + 属性

Text + graph + attributes

增强稀疏环境下的鲁棒性;与模型无关。

Enhances robustness in sparse environments; model-agnostic.

非常高

Very high

GNN、MF混合系统。

GNNs, MF hybrid systems.

Neo4j、GraphSAGE、OpenAI、DGL。

Neo4j, GraphSAGE, OpenAI, DGL.

表 8.1:模型对比概览

Table 8.1: Model comparison overview

基于多层逻辑模型(MLLM)的推荐引擎代表了个性化内容推送的下一个发展阶段。这些系统融合了深度多模态感知和自然语言推理的优势,能够提供更精准的相关性、更丰富的上下文理解和更高的用户满意度。它们尤其擅长处理冷启动场景,生成多样化的推荐,并通过可解释且直观的推荐来提升用户参与度。

MLLM-based recommendation engines represent the next evolution in personalized content delivery. By leveraging the combined strengths of deep multimodal perception and natural language reasoning, these systems offer superior relevance, contextual understanding, and user satisfaction. They are especially useful in handling cold-start scenarios, generating diverse suggestions, and enhancing user engagement through explainable and intuitive recommendations.

在探索了多模态推荐系统的功能和设计原则之后,我们发现,提供高质量的推荐仅仅是成功的一部分。同样重要的是,要能够以结构化且可靠的方式评估、排序和验证这些推荐。这就引出了智能系统的下一个关键方面:评分。在接下来的章节中,我们将重点从推荐生成转向推荐评估,探讨如何应用基于规则和模型驱动的评分机制来对用户反馈进行评分、对推荐进行排序,并确保系统输出符合用户期望和特定领域的标准。

Having explored the capabilities and design principles of multimodal recommendation systems, it becomes evident that delivering high-quality suggestions is only one part of the equation. Equally important is the ability to assess, rank, and validate these recommendations in a structured and reliable manner. This brings us to the next critical aspect of intelligent systems: grading. In the following chapter, we shift our focus from generation to evaluation, examining how grading mechanisms, both rule-based and model-driven, can be applied to score responses, rank recommendations, and ensure system outputs meet user expectations and domain-specific standards.

将评分与 OpenAI 结合使用

Incorporate grading with OpenAI

正如第六章“两阶段和多阶段GenAI系统”中所讨论的评分在验证和优化多模态RAG系统的输出质量方面起着至关重要的作用。如果没有健全的评分机制,一些问题会损害系统的可靠性和用户信任。首先,缺乏质量控制可能导致生成无关、不连贯或虚假的响应,尤其是在结合文本、图像和视频等多种模态时。这会降低用户体验,并削弱推荐或答案的可信度。其次,没有评分的系统无法进行自我评估或随着时间的推移而改进,导致系统停滞不前。 随着知识库的演进,性能甚至可能下降。在医疗保健、教育或金融等安全关键领域,未经分级的输出可能导致错误信息或带有偏见的推荐,造成严重后果。第三,缺乏反馈回路会阻碍微调和模型对齐工作,从而妨碍自适应个性化或性能优化。此外,无法对候选输出进行排序会削弱多候选选择策略,而这些策略原本可以促进多样性和创新性。最后,在多智能体或混合 RAG 设置中,需要评估来自不同检索或推理模块的输出以达成共识,分级对于协调决策至关重要。总之,分级不仅仅是一个后处理步骤,它是确保多模态 RAG 系统准确性、可信度和适应性的基础。如图8.3所示,分级过程位于核心检索和嵌入操作的下游,并作为检索相关性和生成响应质量的智能评估机制:

As discussed in Chapter 6, Two and Multi-stage GenAI Systems, grading plays a critical role in validating and optimizing the output quality of multimodal RAG systems. Without a robust grading mechanism, several issues can compromise system reliability and user trust. First, the absence of quality control may lead to the generation of irrelevant, incoherent, or hallucinated responses, especially when combining diverse modalities like text, images, and video. This degrades user experience and undermines the credibility of recommendations or answers. Second, systems without grading cannot self-assess or improve over time, leading to stagnant or even deteriorating performance as the knowledge base evolves. In safety-critical domains such as healthcare, education, or finance, ungraded outputs can cause misinformation or biased recommendations with serious consequences. Third, the lack of a feedback loop hinders fine-tuning and model alignment efforts, preventing adaptive personalization or performance optimization. Furthermore, the inability to rank candidate outputs weakens multi-candidate selection strategies that could otherwise promote diversity and novelty. Finally, in multi-agent or hybrid RAG setups, where outputs from different retrieval or reasoning modules need to be evaluated for consensus, grading becomes essential for orchestrated decision-making. In summary, grading is not just a post-processing step. It is foundational to ensuring accuracy, trustworthiness, and adaptability in multimodal RAG systems. As shown in Figure 8.3, the grading process is situated downstream of the core retrieval and embedding operations and serves as an intelligent evaluation mechanism for both retrieval relevance and generative response quality:

流程图显示了由文本和图像嵌入模型处理的查询,该查询存储在矢量数据库中,并由 LLM 检索,然后由两个评分员对相关性和响应质量进行评分。

图 8.3:包含分级在内的管道,使用独立的 LLM。

Figure 8.3: Pipeline including grading using a separate LLM

该流程始于用户查询,查询内容会同时通过文本和图像嵌入模型进行处理。这些模型会生成输入的向量表示,然后使用这些向量表示查询一个包含从文档和图像中提取的多模态嵌入的向量数据库。在存储之前,文档会被分割成块,并与任何关联的图像一起嵌入,以支持细粒度的语义检索。

The pipeline begins with a user query, which is simultaneously processed through text and image embedding models. These models generate vector representations of the input, which are then used to query a vector database containing multimodal embeddings derived from both documents and images. Before storage, the documents are segmented into chunks and embedded alongside any associated images to support fine-grained semantic retrieval.

向量数据库返回相关结果的排名列表后,将调用评分组件。该组件由一个承担双重角色的 LLM 提供支持:

Once the vector database returns a ranked list of relevant results, the grading component is invoked. This component is powered by an LLM operating in a dual role:

  • 检索相关性评分器:该模块评估用户查询与检索到的内容之间的语义一致性。它根据上下文一致性、事实一致性和特定任务标准,为每个检索到的条目分配一个相关性分数。
  • Retrieval relevance grader: This module assesses the semantic alignment between the user's query and the retrieved content. It assigns a relevance score to each retrieved item based on contextual fidelity, factual alignment, and task-specific criteria.
  • 生成式回复评分器:该模块根据检索到的内容评估生成的回复质量。它会考虑流畅性、事实准确性、信息量和用户意图一致性等因素。
  • Generative response grader: This module evaluates the quality of responses generated based on the retrieved content. It considers factors such as fluency, factual accuracy, informativeness, and user intent alignment.

这些评分模块共同构成一个反馈回路,不仅决定向用户呈现哪些结果,还能对检索和生成机制进行微调。通过利用语言学习模型(LLM)作为评分器,该系统确保输出质量能够通过先进的语言理解能力进行持续评估,而不是依赖静态的启发式规则。

Together, these grading modules act as a feedback loop, which not only determines which results are presented to the user but also enables fine-tuning of retrieval and generation mechanisms. By leveraging the LLM as a grader, the system ensures that output quality is continually assessed through advanced language understanding capabilities rather than relying on static heuristic rules.

该框架通过集成智能自动评分来提升多模态 RAG 系统的实用性,确保用户在基于检索和生成式的交互中都能获得最相关、最高质量的结果。

This framework elevates the utility of multimodal RAG systems by integrating intelligent, automated grading, ensuring that users receive the most relevant, high-quality results in both retrieval-based and generative interactions.

以下部分将详细介绍各个组件,并附上解释和嵌入式代码。

The following section provides a breakdown of the components with explanations and embedded code.

导入语句

Import statements

该脚本使用 LangChain 的组件来实现 LLM、提示模板和链式编排:

The script uses LangChain’s components for LLMs, prompt templates, and chain orchestration:

from langchain.chat_models import ChatOpenAI

from langchain.chat_models import ChatOpenAI

from langchain.prompts import PromptTemplate

from langchain.prompts import PromptTemplate

from langchain.chains import LLMChain

from langchain.chains import LLMChain

这些库允许与 OpenAI 的 GPT 模型集成,并能够动态构建基于 LLM 的工作流。

These libraries allow integration with OpenAI's GPT models and enable dynamic construction of LLM-based workflows.

生成式响应评分器

Generative responsive grader

该组件根据检索到的上下文评估生成的响应与用户查询的契合度。它使用语言模型,根据检索到的上下文,给出 1 到 5 分的评分,并给出理由,从而能够精确评估响应的质量、连贯性和与用户原始意图的一致性,详情如下:

This component evaluates how well a generated response answers the user’s query based on the retrieved context. It uses a language model to assign a score from one to five, along with a justification, enabling precise assessment of response quality, coherence, and alignment with the original user intent, details as follows:

  • 目的:评估生成的响应在给定上下文的情况下回答查询的准确性和有效性。
  • Purpose: Evaluates how accurately and effectively the generated response answers the query using the given context.
  • 初始化函数

    def init_grader():

    llm = ChatOpenAI(temperature=0.3, model=”gpt-3.5-turbo”)

    提示 = PromptTemplate(

    input_variables=["query", "context", "response"],

    template="""评估以下生成的响应的质量。

    查询:{query}

    上下文:{上下文}

    回复:{response}

    请从 1 到 5 分进行评分,并解释原因。

    评分及理由:

    返回 LLMChain(llm=llm, prompt=prompt)

    • ChatOpenAI模型初始化时温度较低(0.3),以确保评分相对确定且简洁。
    • PromptTemplate定义了输入格式,并指示 LLM 根据一到五的等级评估生成的响应,并给出理由。
    • LLMChain返回的链将提示符和 LLM 连接起来,以便稍后调用。
  • Initialization function:

    def init_grader():

    llm = ChatOpenAI(temperature=0.3, model=”gpt-3.5-turbo”)

    prompt = PromptTemplate(

    input_variables=["query", "context", "response"],

    template="""Evaluate the quality of the following generated response.

    Query: {query}

    Context: {context}

    Response: {response}

    Give a score from 1 to 5 and explain why.

    Score and Justification:"""

    )

    return LLMChain(llm=llm, prompt=prompt)

    • A ChatOpenAI model is initialized with a low temperature=0.3 to ensure relatively deterministic and concise scoring.
    • A PromptTemplate defines the input format and instructs the LLM to evaluate a generated response on a scale of one to five, along with justification.
    • The chain returned by LLMChain links the prompt and the LLM for later invocation.
  • 执行函数

    Python

    编辑

    def grade_response(grader_chain, query: str, context: str, response: str):

    返回 grader_chain.run({

    "查询": 查询,

    “上下文”:上下文,

    “响应”:响应

    })

    • 该函数接受已初始化的grader_chain ,并向其传递特定的查询、上下文和响应。
    • 它会返回一个带有理由的评分评估结果。
  • Execution function:

    python

    CopyEdit

    def grade_response(grader_chain, query: str, context: str, response: str):

    return grader_chain.run({

    "query": query,

    "context": context,

    "response": response

    })

    • This function takes the initialized grader_chain and passes a specific query, context, and response to it.
    • It returns a scored evaluation with justification.

检索相关性评分器

Retrieval relevance grader

以下列表概述了用途、初始化和执行函数:

The following list outlines the purposes, initialization, and execution functions:

  • 目的:评估检索到的文档是否包含与用户查询相关的信息。
  • Purpose: Assesses whether a retrieved document contains relevant information in relation to the user’s query.
  • 初始化函数

    def init_retrieval_grader():

    llm = ChatOpenAI(temperature=0, model=”gpt-3.5-turbo”)

    提示 = PromptTemplate(

    input_variables=["question", "document"],

    template="""您是一名评分员,正在评估检索到的文档与用户问题的相关性。

    如果文档包含与问题相关的关键词或语义信息,则将其评为相关。

    以下是检索到的文档:

    {文档}

    以下是用户提出的问题:

    {问题}

    仔细客观地评估该文件是否包含至少一些与问题相关的信息。

    返回一个 JSON 对象,该对象包含一个名为 binary_score 的键,其值为“yes”或“no”。

    返回 LLMChain(llm=llm, prompt=prompt)

    • 温度设置为0 ,确保一致、可重复的二进制输出(
    • 该题目的提示更注重规则性和结构性,以便进行客观评分。
    • 它需要二元分类的 JSON 输出。
  • Initialization function:

    def init_retrieval_grader():

    llm = ChatOpenAI(temperature=0, model=”gpt-3.5-turbo”)

    prompt = PromptTemplate(

    input_variables=["question", "document"],

    template="""You are a grader assessing relevance of a retrieved document to a user question.

    If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant.

    Here is the retrieved document:

    {document}

    Here is the user question:

    {question}

    Carefully and objectively assess whether the document contains at least some information that is relevant to the question.

    Return a JSON object with a single key, binary_score, that is either 'yes' or 'no'."""

    )

    return LLMChain(llm=llm, prompt=prompt)

    • The temperature is set to 0, ensuring consistent, reproducible binary output (yes or no).
    • The prompt is more rule-based and structured for objective grading.
    • It expects a JSON output for binary classification.
  • 执行函数

    def grade_document_relevance(grader_chain, question: str, document: str):

    返回 grader_chain.run({

    “问题”:问题,

    “文档”:文档

    })

    • 该函数执行评分链,确定文档在语义上或主题上是否与用户的问题相关。
    • 该模块为任何使用LLM进行检索和生成的系统提供即插即用的评估组件。它有助于确保:
  • Execution function:

    def grade_document_relevance(grader_chain, question: str, document: str):

    return grader_chain.run({

    "question": question,

    "document": document

    })

    • This function executes the grading chain, determining if the document is semantically or topically relevant to the user’s question.
    • This module enables plug-and-play evaluation components for any system that uses LLMs for retrieval and generation. It helps ensure:
  • 对生成的回复进行质量控制,按 1 到 5 分的等级进行评分,并给出理由。
  • Quality control of generated responses, rated on a one to five scale with justification.
  • 通过对文档相关性进行二元评分来提高检索精度。
  • Precision in retrieval, via binary scoring of document relevance.

这些评分员对于开发智能和自我评估的 RAG 系统至关重要,其中反馈回路有助于提高可靠性、可解释性和用户满意度。

These graders are crucial for developing intelligent and self-evaluating RAG systems where feedback loops help improve reliability, explainability, and user satisfaction.

分级和生成模型

Grading and generation models

在自然语言系统中,评分和生成是截然不同的任务,因此需要专门的模型或算法链才能达到最佳性能。生成任务是指根据输入提示或检索到的上下文,创建流畅、与上下文相关且符合用户需求的回复。它优先考虑创造性、连贯性和意图满足性。相比之下,评分是一项评估任务,需要客观性、一致性和批判性推理来评估回复或检索内容的质量、正确性或相关性。对这两个任务使用相同的模型可能会引入冲突:生成模型在作为评分器时可能会表现出确认偏差,倾向于自身的输出,从而损害评估的公平性。此外,针对生成任务优化的提示通常鼓励冗长和假设形成,而评分提示则要求精确、简洁和分析严谨。从系统设计的角度来看,将这两个角色分开可以实现针对特定任务的提示设计、温度设置和评分标准。这种模块化设计增强了可解释性,实现了生成性能的基准测试,并允许针对每个任务进行独立的更新或模型选择。如图 8.4所示因此,使用不同的 LLM 或链进行评分和生成符合负责任的 AI 系统设计的最佳实践,并确保 RAG 工作流程中更稳健、透明和负责的建议。

Grading and generation are fundamentally distinct tasks in natural language systems and therefore require specialized models or chains to achieve optimal performance. Generation involves creating fluent, contextually relevant, and user-aligned responses based on input prompts or retrieved context. It prioritizes creativity, coherence, and intent satisfaction. In contrast, grading is an evaluative task that demands objectivity, consistency, and critical reasoning to assess the quality, correctness, or relevance of a response or retrieved content. Using the same model for both tasks can introduce conflicts: a generative model may exhibit confirmation bias by favoring its own outputs when acting as a grader, thus undermining fairness in evaluation. Additionally, prompts optimized for generation typically encourage verbosity and hypothesis formation, whereas grading prompts require precision, brevity, and analytical rigor. From a systems design perspective, separating these two roles allows task-specific prompt engineering, temperature settings, and scoring criteria. This modularity enhances explainability, enables benchmarking of generation performance, and allows independent updates or model selection per task. As shown in Figure 8.4, consequently, using distinct LLMs or chains for grading and generation aligns with best practices in responsible AI system design and ensures more robust, transparent, and accountable recommendations in RAG workflows.

流程图展示了本地学习资源管理(LLM)生成输出的过程,该输出被发送到充当评分器的云端LLM。评分器承担两个角色:检索相关性评分器和生成式响应评分器,这两个角色均对A+论文进行评分。

图 8.4:使用独立 LLM 进行分级

Figure 8.4: Grading with separate LLMs

用于评分的云端LLM

Cloud LLMs for grading

使用基于云的语言学习模型 (LLM) 进行评分相比本地部署具有显著优势,尤其是在可靠性、可扩展性和性能方面。例如,OpenAI 的 GPT-3.5 或 GPT-4 等云 LLM 受益于持续的微调、对海量训练数据的访问以及难以在本地环境中实现的优化基础设施。这些模型会定期更新,以适应最新的语言趋势、推理能力的提升以及安全过滤器的改进,从而对查询响应质量或文档相关性进行更一致、更准确的评估。此外,云 LLM 通常部署在高性能硬件上,能够实现大规模的快速推理,这对于生产环境中的实时或大批量评分任务至关重要。相比之下,本地 LLM 通常会受到 GPU 资源有限和权重过时的限制,这可能会降低评分的准确性。此外,在本地模型上实施版本控制、偏差缓解和及时安全措施也会带来诸多问题。这需要大量的工程投入。对于学术和企业系统而言,评估的稳健性和准确性至关重要,因此利用基于云的语言学习模型(LLM)作为评分工具,可以确保更高的可信度、最新的语言知识和更高的标准化程度,使其成为超越成本考量的更佳选择。

Grading using cloud-based LLMs offers significant advantages over local deployments, especially in the context of reliability, scalability, and performance. Cloud LLMs, such as OpenAI’s GPT-3.5 or GPT-4, benefit from continuous fine-tuning, access to extensive training data, and infrastructure optimizations that are difficult to replicate on-premise. These models are regularly updated to align with the latest linguistic trends, reasoning improvements, and safety filters, resulting in more consistent and accurate evaluations of query-response quality or document relevance. Furthermore, cloud LLMs are typically deployed on high-performance hardware that allows for rapid inference at scale, which is essential for real-time or large-batch grading tasks in production environments. In contrast, local LLMs are often constrained by limited GPU resources and outdated weights, which can degrade grading fidelity. Additionally, implementing version control, bias mitigation, and prompt safety measures on local models requires significant engineering effort. For academic and enterprise systems where robustness and accuracy of evaluation are critical, leveraging cloud-based LLMs as graders ensures higher trustworthiness, up-to-date linguistic knowledge, and greater standardization, making them a superior choice despite cost considerations.

您可以在第 8 章“构建多模态 RAG 系统”的 GitHub 存储库中找到代码位于Chapter_8_multimodal_rag_system_Grader.py 其中包括使用本地 LLM 进行评分。

You can find the code in the GitHub repository of Chapter 8, Building a Multimodal RAG System under Chapter_8_multimodal_rag_system_Grader.py, including grading with local LLM.

通过观察图 8.4,您可能会开始认识到一个新兴概念,即LLM 作为评判者,如果是这样,那就说明已经建立了一种理解。

By examining Figure 8.4, you may begin to recognize an emerging concept known as LLM-as-a-judge, and if so, there is an understanding that has been established.

法学硕士担任法官

LLM-as-a-judge

LLM作为评判者是指利用LLM来评估、评分或排序其他AI系统的输出,尤其是在生成、检索、摘要或推理等任务中。LLM无需使用硬编码规则或人工评分,而是被赋予智能评估者的角色。

LLM-as-a-judge refers to the use of an LLM to evaluate, grade, or rank the outputs of other AI systems, especially in tasks like generation, retrieval, summarization, or reasoning. Instead of using hard-coded rules or human raters, the LLM is prompted to act as an intelligent evaluator.

图 8.5展示了一种最佳实践架构模式,其中评分和生成由独立的 LLM 处理,每个 LLM 都针对不同的用途进行了优化。本地 LLM 用于内容生成任务,确保快速、经济高效且可离线运行。同时,云端 LLM 充当公正的评判者,负责评估检索相关性和响应质量。这种角色分离能够实现更客观的评估,提高反馈回路的完整性,并避免自我评估带来的偏差。使用云端 LLM 进行评判可确保评分的一致性和高质量,并与更广泛的语义理解保持一致,尤其适用于下游任务中所需的复杂或细致的评估。

Figure 8.5 illustrates a best practice architectural pattern where grading and generation are handled by separate LLMs, each optimized for a distinct purpose. A local LLM is used for content generation tasks, ensuring fast, cost-efficient, and offline operation. In parallel, a cloud-hosted LLM acts as an impartial judge, responsible for evaluating both the retrieval relevance and response quality. This separation of roles enables more objective assessment, improves feedback loop integrity, and avoids bias from self-evaluation. The use of cloud LLMs for judgment ensures consistent, high-quality grading aligned with broader semantic understanding, especially for complex or nuanced evaluations required in downstream tasks.

流程图显示本地 LLM 作为生成器,将输出发送到充当评判员的云端 LLM,从而生成检索相关性评分器、生成式响应评分器或其他任务。

图 8.5:LLM 作为评委

Figure 8.5: LLM-as-a-judge

原理和功能

Rationale and functionality

如图 8.5所示, LLM 作为评判者的工作原理是:向一个功能强大的 LLM(例如 GPT-4 或 GPT-3.5 Turbo)提供明确的评价标准(例如相关性、准确性、清晰度或一致性),并要求其根据这些标准评估或比较输出结果。三种常见的方法包括:

LLM-as-a-judge operates, as shown in Figure 8.5, by prompting a capable LLM (e.g., GPT-4 or GPT-3.5 Turbo) with an explicit rubric, such as relevance, accuracy, clarity, or consistency, and asking it to evaluate or compare outputs based on these criteria. Three common approaches include:

  • 单输出评分:LLM 通过根据评分标准评估单个回答,并可选择性地给予一个数值分数(例如,1-5),可以有参考答案,也可以没有参考答案。
  • Single-output scoring: The LLM assigns a numeric score (e.g., 1-5) by assessing a single response against rubric criteria, optionally with or without a reference answer.
  • 成对比较:对于同一查询的两个或多个候选输出,LLM 选择更优的输出。使用 MT-Bench 和 Chatbot Arena 等基准测试的研究表明,LLM 的判断与人工评估的一致性超过 80%。
  • Pairwise comparison: Given two or more candidate outputs to the same query, the LLM selects the better one. Studies using benchmarks like MT-Bench and Chatbot Arena have demonstrated over 80% agreement between LLM judgments and human evaluations.
  • 基于参考的评分:LLM 将生成的输出与参考(或检索到的上下文)进行比较,从而提高评分的一致性,并与人类偏好保持一致。
  • Reference-based scoring: The LLM compares generated output to a reference (or retrieved context), increasing score consistency and alignment with human preferences.

以下描述了它如何应用于我们的系统。

The following describes how it is applied to our system.

在我们的系统中:

In our system:

  • 答题评分器评估 LLM 生成的答案的质量。
  • The response grader evaluates the quality of an LLM-generated answer.
  • 检索相关性评分器判断检索到的文档(或图像的标题)是否与查询相关。
  • The retrieval relevance grader judges whether retrieved documents (or captions for images) are relevant to a query.

两者都是语言学习模型作为评估者的经典例子,它们使用自然语言提示和结构化输出进行主观或语义判断。

Both are classic examples of LLMs acting as evaluators, making subjective or semantic judgments using natural language prompts and structured outputs.

待办事项

To do

目前grader.py中的检索相关性评分器仅能评估文本内容。具体来说,它需要一个文档(假定为文本块)和一个问题,然后根据语义或关键词重叠情况来判断该文档是否与查询相关。这种方法对于评估从语料库中检索到的文本非常有效,但不适用于图像等视觉内容。

The current implementation of the retrieval relevance grader in grader.py is limited to evaluating textual content. Specifically, the prompt expects a document, assumed to be a text chunk, and a question, then determines whether the document is relevant to the query based on semantic or keyword overlap. This approach is effective for evaluating text retrieved from a corpus, but does not apply to visual content such as images.

为了扩展评分系统以支持图像相关性评估,读者应考虑实施以下改进措施之一:

To extend the grading system to support image relevance evaluation, the reader should consider implementing one of the following enhancements:

  • 将图像描述作为预处理步骤:集成图像描述模型(例如 BLIP、ViT-GPT2 或任何基于 Hugging Face 的图像描述生成流程),自动生成每张检索到的图像的文本描述。然后,可以将该描述作为文档变量传递给现有的相关性评分提示。这种方法既能确保评估逻辑的一致性,又能满足 Llama 或 GPT-3.5 等纯文本语言学习模型的限制。
  • Image captioning as a preprocessing step: Incorporate an image captioning model (e.g., BLIP, ViT-GPT2, or any Hugging Face-based captioning pipeline) to automatically generate a textual description of each retrieved image. This caption can then be passed as the document variable to the existing relevance grading prompt. This approach enables consistent evaluation logic while remaining within the constraints of text-only LLMs like Llama or GPT-3.5.
  • 用于直接图像输入的多模态语言模型:或者,可以利用像 GPT-4V、LLaVA 或 MiniGPT-4 这样能够同时接受图像和文本作为输入的 MLLM 模型。这些模型可以直接评估图像在查询上下文中的相关性,而无需中间添加图像描述。这种方法功能更强大,但需要相应的底层架构和运行时环境来支持多模态输入。
  • Multimodal language models for direct image input: Alternatively, leverage an MLLM such as GPT-4V, LLaVA, or MiniGPT-4 that accepts both images and text as input. These models can directly evaluate the relevance of an image in the context of a query without requiring intermediate captioning. This approach is more powerful but requires appropriate infrastructure and runtime support for multimodal input.

实施这两项改进中的任何一项,都将使相关性评分系统更加稳健,并能更好地包容多模态内容,使其与现实世界多模态人工智能系统中端到端 RAG 的更广泛目标保持一致。

Implementing either of these enhancements would make the relevance grading system more robust and inclusive of multimodal content, aligning it with the broader goals of end-to-end RAG in real-world multimodal AI systems.

结论

Conclusion

本章探讨了构建智能、人性化人工智能系统所必需的核心组件的集成。首先,本章展示了语言学习模型(LLM)如何生成与上下文相关的响应。随后,本章将这一概念扩展到多模态推荐领域,其中文本和视觉输入共同指导检索和个性化推荐。读者还学习了如何使用 OpenAI 模型集成评分机制,从而实现对检索内容和生成输出的自动、可扩展评估。本章最后阐述了“语言学习模型作为评判者”的概念,强调了语言学习模型在语义丰富、人性化的评估过程中的作用。

In this chapter, readers explored the integration of core components essential for building intelligent, human-aligned AI systems. Starting with generation, the chapter demonstrated how LLMs can produce contextually relevant responses. This was extended into the realm of multimodal recommendation, where text and visual inputs jointly informed retrieval and personalization. Readers also learned how to incorporate grading mechanisms using OpenAI models, enabling automatic, scalable evaluation of both retrieved content and generated outputs. The chapter culminated with the concept of LLM-as-a-judge, emphasizing the role of LLMs in semantically rich, human-aligned evaluation processes.

在奠定坚实基础之后,下一章将通过引入重排序层来扩展此架构。重排序层是一项关键的增强功能,可在生成之前进一步提升检索质量。读者将了解重排序器如何根据语义相关性、事实依据或用户偏好,有选择地对最佳候选结果进行优先级排序。这一新增功能在多模态 RAG流程中发挥着至关重要的作用,确保输入到 LLM 进行生成的内容不仅相关,而且排序最优。通过这种方式,我们更接近于设计出能够跨模态进行动态推理的强大、可解释且高实用性的 AI 系统。

Having established a strong foundation, the next chapter will extend this architecture by introducing a reranking layer, a critical enhancement that further refines retrieval quality before generation. Readers will understand how rerankers selectively prioritize top candidates based on semantic relevance, factual grounding, or user preferences. This addition plays a vital role in multimodal RAG pipelines, ensuring that the content fed into the LLM for generation is not only relevant but optimally ranked. Through this, we move closer to designing robust, explainable, and high-utility AI systems capable of dynamic reasoning across modalities.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第九利用重排序构建 GenAI 系统

CHAPTER 9Building GenAI Systems with Reranking

介绍

Introduction

在日益视觉化和互联互通的数字世界中,跨不同模态(例如文本和图像)搜索和检索信息的能力已成为高级人工智能( AI ) 应用的基石。本章将介绍多模态检索的概念,即系统旨在理解和关联文本和视觉输入。与仅依赖文本相似性的传统搜索引擎不同,多模态系统利用图像和文本的矢量表示来提供更丰富、更具上下文关联性的搜索结果。您将学习如何构建这样一个系统:集成 Qdrant 作为矢量数据库,使用 Hugging Face 的对比语言-图像预训练( CLIP ) 模型生成图像嵌入,并使用 LangChain 来协调检索过程。这些工具支持对多种数据格式的统一访问,使用户能够执行灵活的跨模态搜索,例如从图像中检索描述或识别与文本输入匹配的图像。

In an increasingly visual and interconnected digital world, the ability to search and retrieve information across different modalities, such as text and images, has become a cornerstone of advanced artificial intelligence (AI) applications. This chapter introduces the concept of multimodal retrieval, where systems are designed to understand and correlate both textual and visual inputs. Unlike traditional search engines that rely solely on textual similarity, multimodal systems use vector representations from both images and text to deliver richer, more contextually aligned results. You will learn how to build such a system by integrating Qdrant as a vector database, Contrastive Language-Image Pretraining (CLIP) models from Hugging Face for generating image embeddings, and LangChain to orchestrate the retrieval process. These tools enable unified access to multiple data formats, allowing users to perform flexible cross-modal searches, such as retrieving descriptions from images or identifying images that match textual inputs.

本章将指导您构建双索引向量存储,并开发能够处理各种查询格式的混合检索器。基于 Python 的实现将引导您完成索引工作流、嵌入管道以及在不同模态之间无缝切换的检索逻辑。除了技术架构之外,本章还将深入探讨一些实用的设计决策,例如相似度评分、模态优先级排序和自定义检索逻辑。最终,您将掌握部署生产就绪型多模态检索器的技能。本课程的基础理论适用于电子商务推荐、视觉内容发现和语义搜索引擎等应用场景。这种实践性强的教学方法不仅能确保您理解理论,还能让您掌握实施可扩展的实际解决方案的能力。

Throughout the chapter, you will construct dual-index vector stores and develop hybrid retrievers capable of handling diverse query formats. Python-based implementations will guide you through indexing workflows, embedding pipelines, and retrieval logic that switches seamlessly between modalities. Beyond technical architecture, the chapter delves into practical design decisions like similarity scoring, modality prioritization, and custom retrieval logic. By the end, you will have the skills to deploy a production-ready multimodal retriever—a foundation applicable to use cases in e-commerce recommendations, visual content discovery, and semantic search engines. This hands-on approach ensures you not only understand the theory but also gain the ability to implement scalable, real-world solutions.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 重新排名
  • Reranking
  • 信息检索和RAG系统中的重排序
  • Reranking in information retrieval and RAG systems
  • 基于交叉编码器的多模态RAG重排序
  • Reranking using cross-encoder in multimodal RAG
  • 多模态环境下的交叉编码器架构
  • Cross-encoder architecture in multimodal settings
  • RAG系统中的多索引嵌入
  • Multi-index embedding in RAG systems
  • 代码实现及说明
  • Code implementation and explanation
  • 待办事项
  • To do

目标

Objectives

本章探讨信息检索和多模态检索增强生成RAG )系统中的重排序。它介绍了关键的重排序器类别,并重点介绍了用于优化检索结果的交叉编码器。读者将了解交叉编码器的架构、多模态上下文中的多索引嵌入(涉及图像和文本),以及这些模型如何提高语义精度。一个实用的代码示例演示了如何在多模态 RAG 流程中实现和集成基于交叉编码器的重排序器。本章最后提供了一个实践练习,鼓励读者完成缺失的组件,并通过实际操作巩固理解。

This chapter explores reranking in information retrieval and multimodal retrieval-augmented generation (RAG) systems. It introduces key reranker categories, with a special focus on cross-encoders for refining retrieved results. Readers will understand the architecture of cross-encoders, multi-index embedding in multimodal contexts, where both images and text are involved, and how these models enhance semantic precision. A practical code walk-through demonstrates how to implement and integrate a cross-encoder-based reranker in a multimodal RAG pipeline. The chapter concludes with a hands-on to do, challenging readers to complete missing components and solidify their understanding through active implementation.

重新排名

Reranking

基于第 1 章新时代生成式人工智能简介”、第 6 章两阶段和多阶段生成式人工智能系统”以及第 8 章构建多模态 RAG 系统”中介绍的基础概念,我们来理解图 9.1。图 9.1展示了一个两阶段 RAG 架构,该架构集成了一个交叉编码器重排序器以提高结果精度。工作流程始于用户查询,查询经过输入后进入检索流程。同时,语料库中的文档被分块,并通过嵌入模型(例如基于 Transformer 的编码器)生成密集向量表示。这些向量表示存储在向量数据库中。

Building upon the foundational concepts introduced in Chapters 1, Introducing New Age generative AI, 6, Two and Multi-stage GenAI Systems, and 8, Building a Multimodal RAG System, let us understand Figure 9.1, which illustrates a two-stage RAG architecture that incorporates a cross-encoder reranker for enhanced result precision. The workflow begins with a user query that passes through input before proceeding to the retrieval pipeline. Simultaneously, documents in the corpus are chunked and passed through an embedding model, such as a transformer-based encoder, to generate dense vector representations. These are stored in a vector database.

流程图展示了用户提交查询的过程,该查询被嵌入并与文档向量数据库进行比较。排名靠前的结果会被重新排序并发送到逻辑逻辑模型(LLM),LLM 生成结果并返回给用户。

图 9.1:RAG 中的交叉编码器重排序

Figure 9.1: Cross-encoder reranking in RAG

在查询时,用户查询被编码成一个向量,并使用近似最近邻ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。

At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.

这些重新排序后的文档连同原始用户查询一起被送入大型语言模型LLM 进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。

These reranked documents, along with the original user query, are passed into the large language model (LLM) for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.

信息检索和RAG系统中的重排序

Reranking in information retrieval and RAG systems

重排序器是传统信息检索系统和现代 RAG 流程中的关键组件。在一般信息检索中,重排序器会对通过快速、通常是近似的方法检索到的初始候选文档列表进行优化。这种第二阶段的重排序对于确保检索到语义或上下文相关性最高的结果至关重要。首先出现的是,随着神经搜索和大规模向量数据库的兴起,重排序器变得更加重要,因为它们弥合了高召回率检索和高精度语义理解之间的差距。

Rerankers are pivotal components in both traditional information retrieval systems and modern RAG pipelines. In general information retrieval, rerankers refine an initial list of candidate documents retrieved by a fast, often approximate method. This second-stage reranking is crucial for ensuring that the most semantically or contextually relevant results are surfaced first. With the rise of neural search and large-scale vector databases, rerankers have become even more important as they bridge the gap between high-recall retrieval and high-precision semantic understanding.

在 RAG 系统中,重排序器扮演着更为关键的角色。典型的 RAG 流程包括检索与用户查询相关的段落或文档,然后将其输入语言模型以生成符合语境的回复。如果检索到的内容相关性较弱或包含大量噪声,最终生成的回复可能包含不实信息或错误信息。重排序器通过基于更深层次语义评估(通常使用强大的语言模型)对检索到的候选内容进行重新排序,从而帮助解决这个问题。这确保只有最相关且上下文最恰当的段落才会被传递到生成阶段,从而提高系统的准确性和可靠性。

In the context of RAG systems, rerankers take on an even more critical role. A typical RAG pipeline involves retrieving passages or documents relevant to a user query and then feeding those to a language model to generate grounded responses. If the retrieved content is only loosely relevant or noisy, the final generation may contain hallucinations or inaccuracies. Rerankers help solve this problem by reordering the retrieved candidates based on a deeper semantic evaluation, often using powerful language models. This ensures that only the most relevant and contextually appropriate passages are forwarded to the generative stage, improving the accuracy and reliability of the system.

重新排名者的类别如下:

The categories of rerankers are as follows:

  • 交叉编码器重排序器:交叉编码器是目前最精确的重排序方法之一。在这种架构中,查询和每个文档被连接起来,并由基于Transformer的模型(例如双向编码器表示模型BERT文本到文本转换Transformer T5 )进行联合处理。这使得查询和候选文档之间能够进行完整的词元间交互和深度上下文对齐。因此,交叉编码器在语义搜索基准测试中通常能取得最先进的性能。然而,这种方法的计算成本很高:每个查询-文档对都必须单独处理,除非候选集已经缩小范围,否则不适用于大规模重排序。

    Cohere 的 Rerank应用程序编程接口( API )等商业产品便是这种方法的典型例证。这些服务允许开发者提交查询和检索到的文档列表,并基于深度语义匹配返回重新评分和重新排序的列表。当精确度比速度或成本更重要时,例如在法律检索、学术研究或候选库相对较小的质量保证系统中,交叉编码器重排序器是理想之选。

  • Cross-encoder rerankers: Cross-encoders represent one of the most precise forms of reranking. In this architecture, the query and each document are concatenated and jointly processed by a transformer-based model such as Bidirectional Encoder Representations from Transformers (BERT) or Text-To-Text Transfer Transformer (T5). This allows full token-to-token interaction and deep contextual alignment between the query and the candidate document. As a result, cross-encoders often achieve state-of-the-art performance in semantic search benchmarks. However, this comes at a high computational cost: each query-document pair must be processed individually, making it impractical for large-scale reranking unless the candidate set is already narrowed down.

    Commercial offerings such as Cohere's Rerank application programming interface (API) exemplify this approach. These services allow developers to submit a query and a list of retrieved documents, returning a rescored and reordered list based on deep semantic matching. Cross-encoder rerankers are ideal when precision is more important than speed or cost, such as in legal search, academic research, or QA systems with relatively small candidate pools.

  • 后期交互重排序器:后期交互模型,例如 ColBERT 及其变体,在效率和精度之间取得了平衡。与交叉编码器不同,它们预先将文档编码为词元级嵌入,仅在运行时对查询进行编码。在重排序过程中,查询中的每个词元都会与候选文档嵌入中的每个词元进行比较,比较方法包括 MaxSim 等相似度运算。这使得词元级匹配成为可能,同时避免了完全联合编码的需要。

    与交叉编码器相比,后期交互模型具有显著更优的可扩展性,非常适合大型数据集。诸如 ColBERTv2 之类的变体采用矢量量化和降维等先进技术,在保持高检索精度的同时降低存储成本。尽管后期交互模型的精确度不如交叉编码器,但它们在有效性和效率方面通常优于传统的双编码器和单向量检索方法。

  • Late interaction rerankers: Late interaction models, such as ColBERT and its variants, strike a balance between efficiency and precision. Unlike cross-encoders, they pre-encode documents into token-level embeddings and only encode the query at runtime. During reranking, each token in the query is compared with every token in the candidate document embeddings using similarity operations such as MaxSim. This allows token-wise matching while avoiding the need for full joint encoding.

    Late interaction models offer significantly better scalability than cross-encoders and are well-suited for large collections. Variants like ColBERTv2 use advanced techniques such as vector quantization and dimensionality reduction to reduce storage costs while maintaining high retrieval accuracy. Although late interaction models are not as precise as cross-encoders, they often outperform traditional bi-encoders and single vector retrieval approaches in both effectiveness and efficiency.

  • 混合重排序方法:混合重排序器整合多种信号,通常是稀疏词汇信号(例如最佳匹配25 ( BM25 ))和来自向量模型的密集语义信号。一种常用方法是分数融合,即将来自不同检索策略的相关性分数进行线性或通过互惠排序融合( RRF )等算法进行组合。另一种方法是将来自多个来源的排序列表组合起来,并使用重排序器来确定最终排序。

    两阶段混合方法在企业搜索和评级、可用性、可搜索性 (RAG) 系统中尤为常见。首先,使用快速的词汇或向量方法检索初始候选词库;然后,应用更强大的重排序器(通常是交叉编码器)对前 N 个结果进行重新排序。这种设置结合了第一阶段的召回率和第二阶段的精确率,从而兼顾了可扩展性和语义深度。在某些系统中,甚至在应用业务逻辑或用户特定约束后,在第三阶段也使用重排序。

  • Hybrid reranking approaches: Hybrid rerankers integrate multiple signals, typically sparse lexical signals like Best Matching 25 (BM25) and dense semantic signals from vector models. One common method is score fusion, where relevance scores from different retrieval strategies are combined, either linearly or via algorithms such as Reciprocal Rank Fusion (RRF). Another pattern involves combining ranked lists from multiple sources and using a reranker to determine final ordering.

    A two-stage hybrid approach is especially common in enterprise search and RAG systems. An initial candidate pool is retrieved using a fast lexical or vector-based method, and then a more powerful reranker, often a cross-encoder, is applied to reorder the top-N results. This setup combines the recall strength of the first-stage with the precision of the second, enabling both scalability and semantic depth. In some systems, reranking is even used in a third stage after applying business logic or user-specific constraints.

  • 排序学习模型:传统的机器学习方法,例如 LambdaMART 或 RankSVM,也可以作为重排序器使用。这些模型结合了多种特征,例如关键词匹配度、文档流行度、时效性,甚至神经分数,以学习最优排序函数。虽然在现代以自然语言处理为中心的系统中不太常见,但这些模型在混合流水线中仍然发挥着作用,尤其是在性能调优至关重要的生产环境中。
  • Learning-to-rank models: Traditional ML methods such as LambdaMART or RankSVM also function as rerankers. These models combine multiple features, like keyword match score, document popularity, recency, or even neural scores, to learn an optimal ranking function. Though less common in modern NLP-centric systems, these models still play a role in hybrid pipelines, especially in production environments where performance tuning is critical.
  • 基于LLM的重排序器:近年来,LLM在重排序领域得到了发展。这些模型可以是经过微调的(例如在相关性任务上训练的T5或GPT变体),也可以是零样本提示方法,即向LLM提供查询和段落列表,并要求其对段落进行排序。这种方法具有很高的灵活性和可解释性,允许基于复杂或动态的标准进行重排序。然而,基于LLM的重排序成本较高且延迟较大,因此更适合小规模或高价值的应用。
  • LLM-based rerankers: A recent development is the use of LLMs for reranking. These can be either fine-tuned models (like T5 or GPT variants trained on relevance tasks) or zero-shot prompting approaches where an LLM is given a query and a list of passages and asked to rank them. This offers high flexibility and interpretability, allowing reranking based on complex or dynamic criteria. However, the cost and latency of LLM-based reranking make them best suited for small-scale or high-value applications.

RAG管道中的重新排序

Reranking in RAG pipelines

在 RAG 流程中,重排序能够显著提升生成前文档检索的质量。例如,向量搜索可能基于余弦相似度检索到 50 个文档,但排名靠前的文档未必是最相关的。重排序器(无论是交叉编码器还是后期交互模型)可以对这些候选文档进行重新排序,确保只有最相关的文档才能进入 LLM 的上下文窗口。这不仅提高了生成精度,而且通过将输出结果与语义对齐的信息联系起来,减少了结果的不确定性。

In RAG pipelines, reranking significantly improves the quality of document retrieval before generation. For example, a vector search might retrieve fifty documents based on cosine similarity, but the top-ranked ones might not always be the most relevant. A reranker, whether a cross-encoder or late interaction model, can reorder these candidates, ensuring that only the most relevant ones are passed into the LLM's context window. This not only improves generation accuracy but also reduces hallucinations by grounding the output in semantically aligned information.

因此,重排序器在 RAG 中充当语义过滤器,将文档池压缩并提炼成一个聚焦的、高精度的上下文,用于生成结果。许多现代 RAG 实现,包括 LangChain 和 LlamaIndex,现在都将重排序作为内置或可选模块。Qdrant WeaviatePinecone等向量数据库也支持超量检索和重排序工作流程,使开发人员能够轻松地将快速检索与精确的语义排序相结合。

Rerankers thus serve as a semantic filter in RAG, compressing and distilling the document pool into a focused, high-precision context for generation. Many modern RAG implementations, including those in LangChain and LlamaIndex, now include reranking as a built-in or optional module. Vector databases like Qdrant, Weaviate, and Pinecone also support over-fetching and reranking workflows, allowing developers to easily combine fast retrieval with accurate semantic sorting.

基于交叉编码器的多模态RAG重排序

Reranking using cross-encoder in multimodal RAG

在多模态随机抽样(RAG)系统中,检索保真度至关重要,它能确保检索到的上下文与输入查询(无论是文本、图像还是多种模态的组合)的相关性和一致性。虽然初始检索通常采用双编码器或双编码器以提高计算可扩展性,但此阶段产生的粗略相似度得分可能缺乏细粒度的语义对齐。这就需要中间重排序阶段,该阶段能够以更高的表达能力和精度评估候选文档。

In multimodal RAG systems, retrieval fidelity is critical to ensure the relevance and alignment of retrieved context with the input query, be it text, image, or a combination of modalities. While initial retrieval is often handled by bi-encoders or dual encoders for computational scalability, the coarse similarity scores produced at this stage may lack fine-grained semantic alignment. This introduces the need for an intermediate reranking stage, which evaluates candidate documents with greater expressiveness and precision.

最有效的重排序策略之一是使用交叉编码器。交叉编码器是一种联合编码查询和每个候选文档的模型,它能计算出更准确的相关性得分。与双向编码器(它独立计算查询和文档的嵌入,并使用余弦相似度或点积相似度进行比较)不同,交叉编码器在两个输入之间执行完整的词元级交互。这种设计支持丰富的交叉注意力机制和更深层次的语义推理,从而获得更高质量的排序结果。

One of the most effective reranking strategies involves the use of cross-encoders, which are models that jointly encode both the query and each candidate document to compute a more accurate relevance score. In contrast to bi-encoders, where embeddings for queries and documents are computed independently and compared using cosine or dot product similarity, cross-encoders perform full token-level interaction between the two inputs. This design allows for rich cross-attention mechanisms and deeper semantic reasoning, resulting in higher-quality rankings.

多模态环境下的交叉编码器架构

Cross-encoder architecture in multimodal settings

在多模态 RAG 环境中,查询或文档(或两者)可能包含文本和图像对,因此交叉编码器必须能够融合视觉和文本输入。这通常通过视觉语言模型( VLM ) 来实现,例如 CLIP、Bootstrapping Language-Image Pre-training ( BLIP )、Flamingo,或者更新的基于 Transformer 的架构,如 GIT、OFA 或 Qwen-VL。这些模型将图像和文本联合编码,使模型能够处理多模态输入。

In a multimodal RAG context, where either the query or documents (or both) may consist of text and image pairs, a cross-encoder must be capable of fusing visual and textual inputs. This is typically achieved through vision-language models (VLMs) such as CLIP, Bootstrapping Language-Image Pre-training (BLIP), Flamingo, or newer transformer-based architectures like GIT, OFA, or Qwen-VL. These models encode image and text jointly, enabling the model to reason over multimodal inputs.

对于重新排名,常见的流程包括:

For reranking, a common pipeline involves:

  • 阶段 1 -初始检索(双编码器):快速密集检索器使用 ANN 搜索向量嵌入来获取前 k 个文档/图像。
  • Stage 1-initial retrieval (bi-encoder): A fast dense retriever fetches the top-k documents/images using ANN search on vector embeddings.
  • 第二阶段——交叉编码器重排序:对于每个检索到的候选词,将查询词和候选词一起输入到交叉编码器中。该模型基于跨模态的查询词和候选词之间的联合注意力计算相关性得分。
  • Stage 2-cross-encoder reranking: For each retrieved candidate, the query and candidate are fed together into a cross-encoder. The model computes a relevance score based on joint attention between query and candidate tokens across modalities.
  • 阶段 3 -前 N 个候选者选择:使用交叉编码器分数对候选者进行排名,并将前 N 个(其中N < k )候选者传递给生成器。
  • Stage 3-top-N selection: Candidates are ranked using cross-encoder scores, and the top-N (where N < k) are passed to the generator.

交叉编码器与后期交互重排序器

Cross-encoders vs. late interaction rerankers

虽然像 ColBERT、ColPali 和 ColQwen 这样的后期交互模型也提供词元级评分,但它们保持查询词元和文档词元的独立编码,推迟精细化评分。与评分阶段相比,交叉编码器可以同时处理两个序列,并通过交叉注意力层实现全局的词元间交互。这使得交叉编码器更具表达力,但计算成本也更高,因为它们必须对每个查询-文档对进行单独编码。

While late interaction models like ColBERT, ColPali, and ColQwen also provide token-level scoring, they maintain independent encoding of query and document tokens, deferring fine-grained comparison to the scoring stage. In contrast, cross-encoders process both sequences simultaneously, enabling global token-to-token interactions via cross-attention layers. This makes cross-encoders more expressive but computationally expensive, as they must encode each query-document pair individually.

表 9.1比较了 RAG 系统中用于检索和重排序的三种常用架构:双编码器、后期交互模型和交叉编码器。每种方法在可扩展性和准确性之间都提供了不同的权衡,这取决于查询-文档对的编码和比较方式。双编码器通过独立编码输入来优先考虑速度和可扩展性,使其成为大规模第一阶段检索的理想选择。后期交互模型在编码后引入词元级比较,从而在性能和成本之间取得平衡。交叉编码器虽然计算量大,但通过联合编码并与两个输入进行深度交互,能够提供最高的准确性,使其成为小规模候选集上精确重排序的首选。

Table 9.1 compares three common architectures used for retrieval and reranking in RAG systems: bi-encoders, late interaction models, and cross-encoders. Each approach offers a different trade-off between scalability and accuracy, based on how query-document pairs are encoded and compared. Bi-encoders prioritize speed and scalability by independently encoding inputs, making them ideal for large-scale first-stage retrieval. Late interaction models introduce token-level comparisons post-encoding, striking a balance between performance and cost. Cross-encoders, though computationally intensive, deliver the highest accuracy by jointly encoding and deeply interacting with both inputs, making them the preferred choice for precision reranking over small candidate sets.

特征

Feature

双编码器

Bi-encoder

后期互动

Late interaction

交叉编码器

Cross-encoder

编码

Encoding

独立的

Independent

独立的

Independent

联合的

Joint

相互作用

Interaction

没有任何

None

标记级(编码后)

Token-level (post-encode)

完整(编码器内)

Full (within encoder)

可扩展性

Scalability

高的

High

缓和

Moderate

低的

Low

准确性

Accuracy

缓和

Moderate

高的

High

最高

Highest

RAG 中的用例

Use case in RAG

第一阶段的回收者

First-stage retriever

轻量级重排器

Light-weight reranker

精确重排序器(小样本量)

Precision reranker (small N)

表 9.1:RAG 系统中检索和重排序架构的比较

Table 9.1: Comparison of architectures for retrieval and reranking in RAG systems

多模态检索中的应用

Applications in multimodal retrieval

在产品搜索、医学影像、视觉问答VQA )和交互式助手等多模态应用场景中,查询可能包含一个问题和一张图片,或者系统可能使用图片作为输入,从文档库中检索相关文本。交叉编码器在这些场景中扮演着至关重要的角色,它确保检索到的文档与查询在语义和模态上保持一致。例如,当用户提交一张笔记本电脑的图片并提出“哪款笔记本电脑同时具备HDMI和USB-C接口?”这样的问题时,交叉编码器可以同时关注图片和产品描述,从而对最相关的匹配结果进行重新排序。

In multimodal use cases, such as product search, medical imaging, visual question answering (VQA), and interactive assistants, a query might consist of a question paired with an image, or the system might retrieve relevant text from a document corpus using an image as the input. Cross-encoders play a vital role in these setups by ensuring that retrieved documents exhibit semantic and modality-aware alignment with the query. For example, when a user submits an image of a laptop with a query, which model has HDMI and USB-C? A cross-encoder can jointly attend to both the image and product descriptions to rerank the most relevant matches.

尽管交叉编码器精度很高,但其计算成本很高,尤其是在图像需要高维编码和预处理的多模态场景中。为了缓解以下问题,人们采用了多种策略:

Despite their accuracy, cross-encoders are computationally expensive, especially in multimodal scenarios where images require high-dimensional encoding and preprocessing. Several strategies are adopted to mitigate the following:

  • 仅对双编码器检索到的前 k 个候选对象使用交叉编码器。
  • Use of cross-encoders only on top-k candidates retrieved by bi-encoders.
  • 模型蒸馏,即训练一个轻量级模型来近似交叉编码器得分。
  • Model distillation, where a lightweight model is trained to approximate cross-encoder scores.
  • 缓存常用查询或热门项目的相关性评分。
  • Caching relevance scores for frequently asked queries or popular items.
  • 查询感知剪枝,其中只有具有重叠元数据的文档才会传递给交叉编码器。
  • Query-aware pruning, where only documents with overlapping metadata are passed to the cross-encoder.

在多模态 RAG 系统中,基于交叉编码器的重排序充当高精度过滤器,在将粗略的检索结果传递给语言模型进行生成之前对其进行精细化处理。通过允许查询词和候选词之间进行充分的交互(包括跨图像和文本输入),交叉编码器显著增强了语义匹配。尽管与其他重排序方法相比,其计算量更大,但由于只需处理少量候选词,因此在实际应用中,交叉编码器对于提升检索质量具有可行性和价值。

In multimodal RAG systems, cross-encoder-based reranking acts as a high-precision filter, refining coarse retrieval outputs before they are passed to the language model for generation. By allowing full interaction between query and candidate tokens, including across image and text inputs, cross-encoders significantly enhance semantic matching. Although computationally heavier than other reranking approaches, their deployment on a small number of candidates makes them feasible and valuable for improving retrieval quality in real-world applications.

商业排名器

Commercial reranker

目前,多家技术提供商提供托管式重排序解决方案,无需开发和维护内部模型,即可轻松集成到搜索或 RAG 流程中。其中最引人注目的是 Cohere 的 Rerank API,它是一款功能强大的基于 Transformer 的交叉编码器,能够接收查询和候选文档列表,并根据语义相关性重新排序,每个文档都附带一个置信度分数。该模型联合处理查询和每个文档,从而实现对上下文的深度理解和精准匹配。最新版本的服务支持长文档、多语言功能以及各种内容类型(包括代码和半结构化数据),同时与早期版本相比,延迟更低,效率更高。

Several technology providers now offer hosted reranking solutions that can be easily integrated into search or RAG pipelines without the need for developing and maintaining in-house models. Among the most prominent is Cohere's Rerank API, a powerful transformer-based cross-encoder that takes a query along with a list of candidate documents and returns them reordered by semantic relevance, each with an associated confidence score. This model processes the query and each document jointly, enabling deep contextual understanding and precise matching. The latest versions of the service support long documents, multilingual capabilities, and various content types, including code and semi-structured data, while maintaining improved latency and efficiency compared to earlier releases.

其他云服务提供商也提供类似的重排序功能。微软的 Azure 认知搜索包含语义重排序功能,它利用图灵序列中的 Transformer 模型来增强前 k 个搜索结果的相关性。这种语义重排序还可以选择性地为排名结果生成高亮显示和解释,使其适用于企业搜索应用。

Other cloud providers offer similar reranking capabilities. Microsoft’s Azure Cognitive Search includes a semantic reranking feature that enhances the relevance of top-k results using transformer-based models from the Turing series. This semantic reranking can optionally generate highlights and explanations for the ranked results, making it suitable for enterprise search applications.

亚马逊通过 Amazon Kendra 和 Amazon Bedrock 等服务提供多种重排序选项。Bedrock 用户可以直接在亚马逊网络服务( AWS ) 生态系统中访问托管的重排序工具,例如 Cohere 的 API ,从而在现有向量或关键词搜索输出的基础上实现高精度语义重排序。

Amazon provides multiple reranking options through services like Amazon Kendra and Amazon Bedrock. Bedrock users can access hosted rerankers such as Cohere’s API directly within the Amazon Web Services (AWS) ecosystem, enabling high-accuracy semantic reranking on top of existing vector or keyword search outputs.

开源生态系统也支持与托管重排序工具集成。例如,OpenSearchElasticsearch可以配置为使用外部 API 作为第二阶段重排序工具。一些开源工具,例如Answer.AI的重排序库,提供了统一的 Python 接口,可与各种重排序模型配合使用,使开发人员能够以最小的努力插入交叉编码器或后期交互模型等替代方案。这些集成使得使用复杂的神经重排序模型升级标准搜索流程成为可能,从而显著提高最终结果的质量。

Open-source ecosystems also support integration with hosted rerankers. For example, OpenSearch and Elasticsearch can be configured to use external APIs as second-stage rerankers. Some open-source tools, such as Answer.AI's reranker library, provide unified Python interfaces to a variety of reranking models, allowing developers to plug in alternatives like cross-encoders or late interaction models with minimal effort. These integrations make it feasible to upgrade standard search pipelines with sophisticated neural reranking models that significantly improve final result quality.

交叉编码器概述

Recap of cross-encoder

如图 9.2第 1 章“新时代生成式人工智能导论”中所述,交叉编码器是一种神经网络模型架构常用于需要细粒度交互的任务。 它采用一对输入,尤其适用于语义相似度、排序和问答等任务。与双编码器不同的是,它能够联合处理查询和候选(例如文档),从而在整个Transformer架构中实现词元级的交叉注意力机制。

A cross-encoder, as explained in Figure 9.2 and Chapter 1, Introducing New Age Generative AI, is a neural model architecture commonly used in tasks requiring fine-grained interaction between a pair of inputs, most notably in semantic similarity, ranking, and QA. It is distinguished from bi-encoders by the fact that it processes both the query and the candidate (e.g., document) jointly, allowing token-level cross-attention throughout the entire transformer stack.

左侧图示为产品图片、描述和规格;箭头指向标有“单编码器”或“交叉编码器”的绿色方框,然后指向右侧蓝色文字“评分”。

图 9.2:交叉编码器

Figure 9.2: Cross-encoder

交叉编码器及其在嵌入中的作用

Cross-encoders and their role in embedding

在检索系统和 RAG 架构的背景下,区分双编码器和交叉编码器至关重要,尤其是在它们的嵌入能力和索引功能方面。

In the context of retrieval systems and RAG architectures, it is essential to distinguish between bi-encoders and cross-encoders, particularly regarding their embedding capabilities and indexing functionality.

交叉编码器是一种模型架构,它联合处理一对输入,通常是查询和候选对象[例如,(查询,文档)或(查询,图像)对]。与独立生成查询和文档嵌入的双向编码器不同,交叉编码器不生成可重用、可索引的嵌入。相反,它通过共享的Transformer模型对两个输入进行编码,从而计算出一个单一的相关性得分(例如,相似度logit)。该得分量化了查询与候选对象的匹配程度,但不会生成任一输入的持久向量表示。

A cross-encoder is a model architecture that jointly processes a pair of inputs, typically a query and a candidate [e.g., (query, document) or (query, image) pair]. Unlike bi-encoders, which generate standalone embeddings for queries and documents independently, cross-encoders do not produce reusable, indexable embeddings. Instead, they compute a single relevance score (e.g., a similarity logit) by encoding both inputs together through a shared transformer model. This score quantifies how well the query matches the candidate, but it does not result in a persistent vector representation of either input.

因此,交叉编码器不适合用于索引。它们不会生成可以存储在向量数据库(例如,Facebook AI 相似性搜索)中的向量表示。Faiss )、Qdrant、ChromaDB)或用于最近邻搜索。相反,它们被用于重排序场景,其中一小集候选结果(通过双向编码器或关键词搜索检索)会被重新评分,以获得更高的语义准确性。

As a result, cross-encoders are not suitable for indexing. They do not generate vector representations that can be stored in vector databases (e.g., Facebook AI Similarity Search (Faiss), Qdrant, ChromaDB) or used for nearest neighbor search. Instead, they are employed in reranking scenarios, where a small set of candidates (retrieved via bi-encoders or keyword search) is rescored for finer semantic accuracy.

为了理解编码器架构之间的实际差异,考察它们对可索引嵌入的支持情况以及这种支持如何影响它们在检索工作流程中的作用十分重要。双向编码器生成适用于大规模搜索的可重用向量表示,而交叉编码器则直接处理查询-文档对,无需生成独立的嵌入即可实现高精度的语义重排序。下表总结了这种根本性的架构差异:

To understand the practical differences between encoder architectures, it is useful to examine their support for indexable embeddings and how that impacts their role in retrieval workflows. While bi-encoders generate reusable vector representations suitable for large-scale search, cross-encoders operate directly on query-document pairs, enabling high-accuracy semantic reranking without producing standalone embeddings. This fundamental architectural difference is summarized in the following table:

编码器类型

Encoder type

可索引嵌入

Indexable embeddings

主要用途

Primary use

双编码器

Bi-encoder

是的

Yes

向量搜索与检索

Vector search and retrieval

交叉编码器

Cross-encoder

No

语义重排序

Semantic reranking

表 9.2:双编码器和交叉编码器架构的比较

Table 9.2: Comparison of bi-encoder and cross-encoder architectures

因此,交叉编码器针对的是评分而非存储进行优化。它们依赖于联合输入编码,因此无法生成分离的查询向量或文档向量。所以,在 RAG 系统中,它们与双向编码器互补,在重排序阶段提高精度,但在初始检索或索引阶段则无此作用。

So, cross-encoders are optimized for scoring, not storage. Their reliance on joint input encoding precludes them from producing detached query or document vectors. Therefore, in RAG systems, they serve a complementary role to bi-encoders by enhancing precision during the reranking stage, but not during the initial retrieval or indexing phases.

在 RAG 系统中,多索引嵌入通过为文本、图像或代码等不同数据类型维护独立的向量索引,实现了模块化和模态感知的信息检索。每个索引都使用由特定模态模型生成的嵌入构建,从而能够根据查询的性质进行精确检索。这种策略在多模态应用中尤为有效,能够实现灵活的路由和来自不同来源的混合检索。相比之下,交叉编码器不生成可索引的嵌入。相反,它们联合处理查询和候选对,并输出一个标量相关性得分。该得分反映了语义对齐,但不能重用或存储用于基于向量的搜索。因此,交叉编码器仅应用于重排序阶段,在该阶段,通过多索引嵌入检索到的一小部分候选结果会被重新评估以进行最终选择。这些方法共同构成了一个稳健的架构:多索引嵌入确保了广度和模态覆盖,而交叉编码器则在流程的最后一步提高了语义精度。因此,交叉编码器不会生成多索引嵌入。让我们来了解一下什么是多索引嵌入。

In RAG systems, multi-index embedding enables modular and modality-aware information retrieval by maintaining separate vector indexes for different data types such as text, images, or code. Each index is constructed using embeddings generated from modality-specific models, facilitating precise retrieval tailored to the nature of the query. This strategy is particularly effective in multimodal applications, allowing for flexible routing and hybrid retrieval from diverse sources. In contrast, cross-encoders do not generate indexable embeddings. Instead, they process a query and candidate pair jointly and output a single scalar relevance score. This score reflects semantic alignment but cannot be reused or stored for vector-based search. As a result, cross-encoders are exclusively applied in the reranking phase, where a small set of candidates retrieved via multi-index embeddings are re-evaluated for final selection. Together, these approaches offer a robust architecture: multi-index embeddings ensure breadth and modality coverage, while cross-encoders enhance semantic precision at the final step of the pipeline. So cross cross-encoders do not create multi-index embeddings. Let us understand what a multi-index embedding is.

RAG系统中的多索引嵌入

Multi-index embedding in RAG systems

多索引嵌入是指在 RAG 架构中构建和使用多个向量索引。这种方法使系统能够从异构数据源中检索语义相关的信息,从而提高生成模型的精确度、上下文对齐能力和多模态推理能力。为了更好地理解,请参考以下列表:

Multi-index embedding refers to the construction and utilization of multiple vector indexes within RAG architectures. This approach enables systems to retrieve semantically relevant information from heterogeneous data sources, improving the precision, contextual alignment, and multimodal reasoning capabilities of the generative model. To build an understanding, refer to the following list:

  • 定义和原理:在传统的 RAG 系统中,使用单个向量索引来编码和检索数据。然而,这种方法在处理多种数据模态(例如,文本规范、图像、结构化表格或代码)时通常缺乏灵活性。多索引嵌入通过维护多个语义专门化的向量存储来解决这一局限性,每个向量存储都针对特定的模态、领域或嵌入模型进行定制。
  • Definition and rationale: In traditional RAG systems, a single vector index is employed to encode and retrieve data. However, this approach often lacks flexibility when dealing with diverse data modalities (e.g., textual specifications, images, structured tables, or code). Multi-index embedding addresses this limitation by maintaining multiple semantically specialized vector stores, each tailored to a particular modality, domain, or embedding model.
  • 应用及优势
    • 模态特定检索:可以为文本文件(使用 text-embedding-3-large 等模型)、图像(例如,通过 CLIP)和代码(例如,通过 CodeBERT)构建单独的索引,从而允许从每种模态进行有针对性的检索。
    • 模型优化:索引可以利用针对各自内容类型优化的不同嵌入模型,从而提高检索准确率。
    • 灵活的查询路由:在推理过程中,查询可以独立或并行地定向到相关的索引,结果可以聚合,并可选择性地重新排序。
    • 提高可解释性:使用多个索引可以对检索到的资源进行细粒度分析,有助于提高输出结果的可解释性和验证性。
  • Applications and benefits:
    • Modality-specific retrieval: Separate indexes can be constructed for textual documents (using models like text-embedding-3-large), images (e.g., via CLIP), and code (e.g., via CodeBERT), allowing targeted retrieval from each modality.
    • Model optimization: Indexes can leverage different embedding models optimized for their respective content types, thereby enhancing retrieval accuracy.
    • Flexible query routing: During inference, queries may be directed to relevant indexes either independently or in parallel, with results aggregated and optionally reranked.
    • Improved interpretability: The use of multiple indexes allows fine-grained analysis of retrieved sources, aiding in explainability and validation of outputs.
  • 系统工作流程
    • 索引构建:使用多个嵌入管道为不同的数据源或模态构建索引。
    • 查询处理:用户查询被嵌入并根据模态相关性或预定义规则路由到一个或多个索引。
    • 聚合和重排序:从每个索引中检索到的文档被合并,并可选择使用交叉编码器或评分机制进行重排序,以提高相关性。
    • 答案生成:将排名靠前的文档传递给语言模型,合成自然语言响应。

      多索引嵌入通过对不同内容类型进行差异化处理,为 RAG 系统引入了模块化和精确性。它支持混合和多模态检索策略,这对于开发能够跨越各种数据环境进行推理的强大 AI 系统至关重要。

  • System workflow:
    • Index construction: Multiple embedding pipelines are used to construct indexes for distinct data sources or modalities.
    • Query processing: A user query is embedded and routed to one or more indexes based on modality relevance or pre-defined rules.
    • Aggregation and reranking: Retrieved documents from each index are merged and optionally reranked using a cross-encoder or scoring mechanism to improve relevance.
    • Answer generation: The top-ranked documents are passed to the language model for synthesis into a natural language response.

      Multi-index embedding introduces modularity and precision into RAG systems by allowing differentiated treatment of diverse content types. It supports hybrid and multimodal retrieval strategies, which are essential for developing robust AI systems capable of reasoning across varied data landscapes.

  • 技术架构:交叉编码器的核心是基于Transformer模型(例如BERT、RoBERTa、DeBERTa)。与独立计算每个输入嵌入的双向编码器不同,交叉编码器将输入连接起来,并将它们一起输入到一个共享的Transformer模型中。
  • Technical architecture: At the core, a cross-encoder is based on transformer models (e.g., BERT, RoBERTa, DeBERTa). Unlike bi-encoders, which compute embeddings for each input independently, a cross-encoder concatenates the inputs and feeds them together into a shared transformer.
  • 输入格式:输入内容以单个序列的形式呈现:

    [CLS] 查询词法单元 [SEP] 文档词法单元 [SEP]

    转换器处理此序列,输出通常取自[CLS]标记,该标记聚合了整个输入的上下文表示。

  • Input formatting: The input is formatted as a single sequence:

    [CLS] Query tokens [SEP] Document tokens [SEP]

    The transformer processes this sequence, and the output is typically taken from the [CLS] token, which aggregates the contextualized representation of the entire input.

  • 数学解释:设以下为:
    • 设 Q = {q₁ , q₂ , ..., q_m}为分词后的查询
    • 设 D = {d₁ , d₂ , ..., d_n}为分词后的文档
    • T = [CLS], q₁ , ..., q_m, [SEP], d₁ , ..., d_n, [SEP]为组合输入序列
    • H⁰ ^{(m+n+3)×d}为输入标记的初始嵌入矩阵(其中d为隐藏维度),通过词嵌入和位置编码导出。

      在每个 Transformer 层l ,表示更新如下:

  • Mathematical explanation: Let the following be:
    • Q = {q, q, ..., q_m} be the tokenized query
    • D = {d, d, ..., d_n} be the tokenized document
    • Let T = [CLS], q, ..., q_m, [SEP], d, ..., d_n, [SEP] be the combined input sequence
    • Let H⁰ ^{(m+n+3)×d}, be the initial embedding matrix of the input tokens (where d is the hidden dimension), derived via word embeddings and positional encodings.

      At each transformer layer l, the representation is updated as:

H^{l} = TransformerLayer(H^{l−1})

H^{l} = TransformerLayer(H^{l−1})

其中TransformerLayer对QD中的所有标记同时应用多头自注意力机制,从而实现完全的交叉交互。

Where TransformerLayer applies multi-head self-attention across all tokens in Q and D together, allowing for full cross-interaction.

在最后一层l ,采用池化策略:

At the final layer l, a pooling strategy is used:

通常,最终输出向量z ^d是[CLS]标记的表示,记为z = H^L_0

Often, the final output vector z ^d is the representation of the [CLS] token, denoted z = H^L_0

该向量被传递给评分头(例如,前馈层后接 sigmoid 或 softmax 层)进行预测:

This vector is passed to a scoring head (e.g., a feed-forward layer followed by a sigmoid or softmax) to predict:

  • 相似度评分
  • A similarity score
  • 二元标签(相关/不相关)
  • A binary label (relevant / not relevant)

示例:相关性评分

Example: Relevance scoring

查询Q与文档D之间的最终输出得分s可计算如下:

The final output score s between a query Q and document D may be computed as:

s = sigmoid ( w T z+b )其中:

s = sigmoid (wT z+b) Where:

  • z R d 是嵌入
  • z Rd is the embedding
  • w R d , b R是学习到的参数
  • wRd, bR are learned parameters

在训练过程中,可以使用二元标签(相关或不相关)对该分数进行监督,并使用诸如以下损失函数:

In training, this score can be supervised using binary labels (relevant or not), using loss functions such as:

  • 二进制交叉熵BCE )用于逐点排序。
  • Binary cross-entropy (BCE) for pointwise ranking.
  • 用于成对训练的铰链损失成对排序损失。
  • Hinge loss or pairwise ranking loss for pairwise training.

在许多检索系统(例如两阶段 RAG)中,交叉编码器构成第二阶段。其中,双编码器或向量搜索检索候选文档,而交叉编码器则基于语义丰富度优化排序。它们在相关性建模方面的高保真度证明了其计算成本的合理性。

Cross-encoders form the second-stage in many retrieval systems (like two-stage RAG), where a bi-encoder or vector search retrieves candidate documents and a cross-encoder refines the ranking based on semantic richness. Their computational cost is justified by their high fidelity in relevance modeling.

代码实现及说明

Code implementation and explanation

以下代码实现了一个模块化的多模态 RAG 流程,用于根据图像、文本或混合输入检索和生成笔记本电脑规格。该系统利用基于 CLIP 的图像-文本嵌入、ChromaDB 进行矢量搜索以及基于 Ollama 的 LLM 进行生成,提供多种查询模式:纯图像、纯文本、图像+文本以及生成式答案补全。

The following code implements a modular Multimodal RAG pipeline to retrieve and generate laptop specifications based on image, text, or hybrid inputs. Leveraging CLIP-based image-text embeddings, ChromaDB for vector search, and an Ollama-based LLM for generation, the system offers multiple query modes: image-only, text-only, image + text, and generative answer completion.

本节概述了基于 RAG 流水线构建的模块化多模态助手系统的完整架构和实现细节。它从config.py中的集中式配置管理开始,逐步介绍基于 CLIP 的嵌入函数、数据加载器以及基于 ChromaDB 的文本和图像内容索引创建。该系统支持多种检索模式,并支持联合查询的向量融合。为了提高精度,在最终输出之前应用了基于交叉编码器的重排序器。该系统还集成了基于 Ollama 的文本生成功能和提供四种交互模式的 Streamlit 用户界面。这些组件共同展示了一个可扩展的 RAG 实现,适用于实际的多模态搜索和问答场景,详情如下:

This section outlines the full architecture and implementation details of a modular, multimodal assistant system built using a RAG pipeline. It begins with centralized configuration management in config.py and moves through CLIP-based embedding functions, data loaders, and ChromaDB-based index creation for both text and image content. It supports multiple retrieval modes and enables vector fusion for joint queries. To enhance precision, a cross-encoder-based reranker is applied before the final output. The system also integrates Ollama-based text generation and a Streamlit UI offering four interactive modes. Together, these components demonstrate a scalable and extensible RAG implementation for real-world multimodal search and question answering, details as follows:

  • 配置管理:配置文件config.py集中管理关键参数,以便于维护和重用。这些参数包括 ChromaDB 目录、模型名称以及图像和文本的文件夹路径:

    CHROMA_PERSIST_DIR = "chromadb_storage"

    CHROMA_IMAGE_COLLECTION = "laptop_images"

    CHROMA_TEXT_COLLECTION = "laptop_texts"

    IMAGE_FOLDER = "data/images"

    TEXT_FOLDER = "data/documents"

    EMBED_MODEL_NAME = "剪辑"

    MODEL_NAME = "llama3"

  • Configuration management: The configuration file config.py centralizes key parameters for maintainability and reuse. This includes ChromaDB directories, model names, and folder paths for images and texts:

    CHROMA_PERSIST_DIR = "chromadb_storage"

    CHROMA_IMAGE_COLLECTION = "laptop_images"

    CHROMA_TEXT_COLLECTION = "laptop_texts"

    IMAGE_FOLDER = "data/images"

    TEXT_FOLDER = "data/documents"

    EMBED_MODEL_NAME = "clip"

    MODEL_NAME = "llama3"

  • 使用 CLIP 的嵌入函数embedding_utils.py定义了可重用的函数,用于使用 CLIP 模型嵌入文本和图像输入。CLIP 处理器和模型全局初始化一次,以避免重复加载:

    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

  • Embedding functions using CLIP: embedding_utils.py defines reusable functions to embed text and image inputs using the CLIP model. The CLIP processor and model are initialized once globally to avoid redundant loading:

    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

嵌入由以下方式处理

Embedding is handled by:

def embed_text_ollama(text):

def embed_text_ollama(text):

inputs = clip_processor(text=[text], return_tensors="pt",padding=True,truncation=True)

inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)

...

...

返回 outputs[0].tolist()

return outputs[0].tolist()

def embed_image_ollama(image_path):

def embed_image_ollama(image_path):

image = Image.open(image_path).convert("RGB")

image = Image.open(image_path).convert("RGB")

...

...

返回 outputs[0].tolist()

return outputs[0].tolist()

这些函数生成用于在 ChromaDB 向量存储中进行检索的向量表示。

These functions produce vector representations used for retrieval in the ChromaDB vector store.

  • 用于索引的数据加载器loaders.py模块定义了用于读取.txt文件和加载.jpg / .png图像路径的实用函数

    def load_text_documents(folder):

    ...

    返回文档

    Python

    编辑

    def load_image_paths(folder):

    ...

    返回 [os.path.join(folder, f) ...]

  • Data loaders for indexing: The loaders.py module defines utility functions for reading .txt files and loading .jpg/.png image paths:

    def load_text_documents(folder):

    ...

    return docs

    python

    CopyEdit

    def load_image_paths(folder):

    ...

    return [os.path.join(folder, f) ...]

  • 使用 ChromaDB 构建索引index_builder.py脚本会将文本文件和图像的嵌入向量和元数据填充到 ChromaDB 中。它会创建或重新创建两个独立的集合:

    text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    每个项目都是通过以下方式嵌入和添加的:

    text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

    此步骤由 run_once.py 控制:

    from rag.index_builder import build_index

    如果 __name__ == "__main__":

    构建索引()

  • Index building with ChromaDB: The index_builder.py script populates ChromaDB with embeddings and metadata for both text documents and images. It creates or recreates two separate collections:

    text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    Each item is embedded and added using:

    text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

    This step is orchestrated via run_once.py:

    from rag.index_builder import build_index

    if __name__ == "__main__":

    build_index()

我们的代码正在创建多索引向量。在 ChromaDB 中创建两个独立的索引,一个用于文本,一个用于图像:

Our code is performing multi-index vector creation. two separate indexes, one for text and one for images, within ChromaDB:

  • 在index_builder.py,系统创建了:
    • CHROMA_TEXT_COLLECTION :文本文件的索引,使用embed_text_ollama将每个文档转换为向量表示。
    • CHROMA_IMAGE_COLLECTION :图像文件的索引,使用embed_image_ollama嵌入每个图像。
  • In index_builder.py, the system creates:
    • CHROMA_TEXT_COLLECTION: An index for text documents, using embed_text_ollama to convert each document into a vector representation.
    • CHROMA_IMAGE_COLLECTION: an index for image files, using embed_image_ollama to embed each image.
  • 这两个数据集截然不同,且分别针对不同的模态:
    • 其中一个用于基于文本的嵌入。
    • 另一种是基于图像的嵌入。
    • app.py中,助手会根据用户选择的模式(例如,仅文本查询、仅图像查询或组合输入)分别查询这些索引。
    • 该系统创建并维护两个独立的向量索引——一个用于文本,一个用于图像——每个索引都针对特定的模态进行定制。根据用户的输入(文本、图像或两者兼有),系统会选择并查询相应的索引(或将它们融合),从而实现灵活且模态感知的检索。

      注意:您可以通过组合多个嵌入(例如,图像 + 文本)来创建一个长向量,而我们的代码已经在图像 + 文本specs 模式下实现了这一点:

      joint_vec = [(i + j) / 2 for i, j in zip(image_vec, text_vec)]

      这是两个相同长度向量的简单平均融合。

  • These two collections are distinct and modality-specific:
    • One is used for text-based embeddings.
    • The other is for image-based embeddings.
    • In app.py, the assistant queries these indexes separately, depending on the user’s selected mode (e.g., text-only query, image-only query, or combined input).
    • The system creates and maintains two separate vector indexes—one for text and one for images—each tailored to a specific modality. Depending on the user’s input (text, image, or both), the system selects and queries the appropriate index (or fuses them), enabling flexible, modality-aware retrieval.

      Note: You can create a single long vector by combining multiple embeddings (e.g., image + text), and our code already does this in the image + text specs mode:

      joint_vec = [(i + j) / 2 for i, j in zip(image_vec, text_vec)]

      This is a simple average fusion of two same-length vectors.

  • 对于一个长向量,还有其他选择
    • 如果创建一个连接后的向量,可以这样做:

      joint_vec = image_vec + text_vec # 会生成一个更长的向量(例如,如果每个向量都是 512,则总向量为 1024)

    • 这称为向量连接,其有效条件是:
      • 您的矢量数据库(例如 ChromaDB)支持更高维度。
      • 在索引和查询时,您都使用相同的策略。
  • Other options for one long vector:
    • If you want to create a single concatenated vector, you can do:

      joint_vec = image_vec + text_vec # results in a longer vector (e.g., 1024 if each is 512)

    • This is called vector concatenation, and is valid if:
      • Your vector database (like ChromaDB) supports higher dimensions.
      • You use the same strategy during both indexing and query time.
  • 交叉编码器重排序reranker.py模块引入了一个CrossEncoder模型,用于对检索到的文档进行重排序:
  • Cross-encoder reranking: The reranker.py module introduces a CrossEncoder model for reranking retrieved documents:

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

给定查询和候选元数据,重排序器根据语义相似性对结果进行评分和排名:

Given a query and candidate metadata, the reranker scores and ranks results based on semantic similarity:

def rerank(query, metadatas):

def rerank(query, metadatas):

pairs = [(query, doc.get("file", "")) for doc in metadatas]

pairs = [(query, doc.get("file", "")) for doc in metadatas]

...

...

返回 [doc for doc, _ in ranking]

return [doc for doc, _ in ranked]

这提高了 ChromaDB 返回的最佳结果的精确度。

This improves the precision of top results returned by ChromaDB.

  • 使用 Ollama 进行语言生成:新的generation.py模块引入了一个可调用方法,用于调用具有固定温度设置的 Ollama LLM:

    def get_llm():

    返回 Ollama(model=MODEL_NAME, temperature=0.2)

    除了简单的检索之外,这对于生成人类可读的规范或摘要非常有用。

  • Language generation with Ollama: The new generation.py module introduces a callable method to invoke an Ollama LLM with a fixed temperature setting:

    def get_llm():

    return Ollama(model=MODEL_NAME, temperature=0.2)

    This is useful for generating human-readable specifications or summaries beyond simple retrieval.

  • 基于 Streamlit 的用户界面app.py模块使用 Streamlit 提供了一个用户友好的前端。该助手提供四种模式:
    • 图片到规格:嵌入图片,检索相似图片,并获取相关规格。
    • 图像+文本到规格:对文本和图像向量取平均值检索并重新排序。
    • 文本转图像 + 规格:纯文本查询,检索规格文档并与图像进行匹配。
    • 文本生成答案:将查询发送给 LLM 以生成响应:

      如果模式 == "文本 → 生成的答案":

      query = st.text_input("询问有关笔记本电脑的问题")

      如果查询:

      llm = get_llm()

      response = llm.invoke(query)

      st.text_area("LLM 响应", response, height=300)

  • Streamlit-based user interface: The app.py module presents a user-friendly frontend using Streamlit. The assistant offers four modes:
    • Image to specs: Embeds image, retrieves similar images, and fetches associated specs.
    • Image + text to specs: Averages text and image vectors, retrieves and reranks.
    • Text-to-image + specs: Pure text query, retrieves the spec document and matches with image.
    • Text to generated answer: Sends the query to the LLM for a generative response:

      if mode == "Text → Generated Answer":

      query = st.text_input("Ask something about laptops")

      if query:

      llm = get_llm()

      response = llm.invoke(query)

      st.text_area("LLM Response", response, height=300)

每种模式都与相应的 ChromaDB 集合进行交互,并执行重新排序,以确保显示最相关的响应。

Each mode interacts with the respective ChromaDB collection and performs reranking to ensure the most relevant response is shown.

这个模块化、多模态的助手系统是 RAG 流程的实际应用示例。通过将配置、嵌入、检索、重排序和生成等步骤清晰地分离,该系统保持了高度的可扩展性和易于维护性。未来的增强功能可能包括文档摘要、多语言支持或用于聊天交互的记忆机制。

This modular, multimodal assistant system exemplifies a real-world implementation of a RAG pipeline. By cleanly separating configuration, embedding, retrieval, reranking, and generation, the system remains highly extensible and easily maintainable. Future enhancements may include document summarization, multilingual support, or a memory mechanism for chat-based interaction.

待办事项

To do

虽然目前的实现方式建立了一个强大的多模态检索流程,但需要注意的是,它目前还不支持生成式输出。

While the current implementation establishes a robust multimodal retrieval pipeline, it is important to recognize that it does not yet support generative outputs.

在项目的当前阶段,有两个关键项目故意未完成,以鼓励实践和更深入的理解。

In the current state of the project, two key items are intentionally left incomplete to encourage hands-on practice and deeper understanding.

  • generation.py :你的任务是实现生成模块。

    目前,rag/文件夹中还没有包含功能齐全的generation.py 文件你的任务是根据预期功能创建这个模块:

    • 本模块将使用 Ollama 语言模型(通过 LangChain)来生成用户查询的自然语言响应。
    • 你应该从config.py导入模型名称,使用langchain_community.llms.Ollama初始化 LLM ,并定义一个get_llm()方法。
    • 完成后,该模块将在app.py中启用文本 → 生成答案模式

      注意:此功能将使您的多模态 RAG 助手不仅可以检索规格,还可以生成流畅的笔记本电脑功能解释或摘要。

  • generation.py: Your task is to implement the generation module.

    At this point, the rag/ folder does not yet include a fully functional generation.py. Your task is to create this module based on the intended functionality:

    • This module will use the Ollama language model (via LangChain) to generate natural language responses from user queries.
    • You should import the model name from config.py, initialize the LLM using langchain_community.llms.Ollama, and define a get_llm() method.
    • Once completed, this module will enable the Text → Generated Answer mode in app.py.

      Note: This addition will allow your multimodal RAG assistant to not only retrieve specs but also generate fluent explanations or summaries of laptop features.

  • run_once.py :您的任务是正确移动和使用此脚本。

    文件run_once.py (用于从所有可用的笔记本电脑图像和规格文档构建 ChromaDB 索引)应该移动到scripts/文件夹中(如果尚未移动到该文件夹​​中)。

    • 这种组织结构提高了模块化程度,并将实用程序脚本与核心逻辑分离。
    • 请确保您仍然可以使用以下方法正确运行它:

      python -m scripts.run_once

      一旦run_once.py就位且generation.py实现,您的完整多模态 RAG 系统就将完成并可投入生产使用。

  • run_once.py: Your task is to move and use this script properly.

    The file run_once.py, which builds your ChromaDB index from all available laptop images and specification documents, should be moved into the scripts/ folder (if not already).

    • This organization improves modularity and keeps utility scripts separate from core logic.
    • Make sure you can still run it correctly using:

      python -m scripts.run_once

      Once run_once.py is in place and generation.py is implemented, your full multimodal RAG system will be complete and production-ready.

安装说明

Setup instructions

以下是完整的设置说明,指导您从零开始搭建并运行多模式 RAG 生成系统:

Here are the complete setup instructions to get your multimodal RAG system with generation up and running from scratch:

1. 环境要求:请确保您使用的是:

1. Environment requirements: Ensure that you are using:

a. Python 3.9 或更高版本

a. Python 3.9 or later

b. 皮普或康达

b. Pip or conda

c. 互联网接入(用于下载模型)

c. Internet access (to download models)

2. 目录结构:请按照下图所示设置文件夹:

2. Directory structure: Setup your folder like shown in the following figure:

图中展示了名为 multimodal_rag_demo 的项目的文件树,其中包含 data、frontend、rag 和 scripts 四个文件夹。data 文件夹包含图片和文档;勾选标记和注释表示文件添加或移动操作。

图 9.3:最终文件夹结构

Figure 9.3: Final folder structure

3. 安装依赖项:创建虚拟环境并安装所需的软件包:

3. Install dependencies: Create a virtual environment and install required packages:

python -m venv venv

python -m venv venv

source venv/bin/activate # 在 Windows 系统上:venv\Scripts\activate

source venv/bin/activate # On Windows: venv\Scripts\activate

pip install --upgrade pip

pip install --upgrade pip

pip install streamlit torch torchvision transformers sentence-transformers chromadb langchain

pip install streamlit torch torchvision transformers sentence-transformers chromadb langchain

4. 下载预训练模型(可选:仅限首次使用)

4. Download pretrained models (Optional: First time only):

a. 首次运行将下载:

a. Your first run will download:

i. openai/clip-vit-base-patch32 用于图像/文本嵌入

i. openai/clip-vit-base-patch32 for image/text embedding

二. 用于重新排名的交叉编码器/ms-marco-MiniLM-L-6-v2

ii. cross-encoder/ms-marco-MiniLM-L-6-v2 for reranking

请确保网络连接稳定。

Make sure you have a stable internet connection.

5. 准备数据:将您的.txt规格文档和.jpg笔记本电脑图片放入以下位置:

5. Prepare your data: Place your .txt spec documents and .jpg laptop images in:

数据/文档/

data/documents/

数据/图像/

data/images/

一个。 确保文本和图像文件名一致(例如,dell_inspiron.jpgdell_inspiron.txt )。

a. Ensure the text and image filenames correspond (e.g., dell_inspiron.jpg and dell_inspiron.txt).

b. 构建索引(初始):运行一次以创建 ChromaDB 集合:

b. Build index (Initial): Run once to create ChromaDB collections:

python run_once.py

python run_once.py

这会将所有文本和图像嵌入到 Chroma 中并持久存储。

This embeds all text and images into Chroma and stores them persistently.

c. 启动应用程序:启动 Streamlit 应用程序:

c. Launch the app: Start the Streamlit app:

streamlit 运行 app.py

streamlit run app.py

在浏览器中访问:http://localhost:8501/

Access in your browser at: http://localhost:8501/

d. Requirements.txt(可选)

d. Requirements.txt (Optional):

Streamlit

streamlit

火炬

torch

变压器

transformers

句子转换器

sentence-transformers

chromadb

chromadb

langchain

langchain

枕头

Pillow

e. 然后,运行以下命令:

e. Then, run the following commands:

狂欢

bash

编辑

CopyEdit

pip install -r requirements.txt

pip install -r requirements.txt

结论

Conclusion

在本章中,您探索了重排序在增强多模态 RAG 系统中信息检索方面的作用。通过对重排序器进行分类并重点关注强大的交叉编码器方法,您学习了如何提高从文本和图像数据中检索结果的质量。您研究了多模态环境下交叉编码器的架构和逻辑,并实现了一个可运行的重排序器来改进图像-文本检索流程。为了巩固您的理解,一系列实践练习要求您填写缺失的代码和结构。在下一章中,我们将探索各种检索优化技术。

In this chapter, you explored the role of reranking in enhancing information retrieval within multimodal RAG systems. By categorizing rerankers and focusing on the powerful cross-encoder approach, you learned how to improve the quality of results retrieved from both textual and visual data. You examined the architecture and logic behind cross-encoders in multimodal contexts and implemented a working reranker to refine image-text retrieval pipelines. To solidify your understanding, a set of practical to dos challenged you to fill in missing code and structure. In the next chapter, we will explore various retrieval optimization techniques.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

第十面向多模态 GenAI 的检索优化

CHAPTER 10Retrieval Optimization for Multimodal GenAI

介绍

Introduction

有效的检索优化对于构建稳健且响应迅速的生成式人工智能GenAI )系统至关重要,尤其是在多模态和检索增强生成RAG )场景中。在实际部署中,仅仅嵌入和检索数据是不够的;优化检索流程会显著影响生成响应的准确性、效率和相关性。

Effective retrieval optimization is critical to building robust and responsive generative AI (GenAI) systems, particularly in multimodal and retrieval-augmented generation (RAG) scenarios. In practical deployments, merely embedding and retrieving data is insufficient; optimizing the retrieval pipeline significantly impacts the accuracy, efficiency, and relevance of generated responses.

本章系统地探讨了多索引嵌入、基于模态的路由和混合检索等关键检索优化技术。我们不仅对每种方法进行了概念定义,还提供了清晰可执行的Python代码示例,以展示它们的实现和实际应用价值。通过应用查询扩展、嵌入归一化和自适应索引刷新等技术,读者将学习如何在生产级GenAI系统中提升系统的召回率、精确率、适应性和关键属性。

In this chapter, we systematically explore key retrieval optimization techniques such as multi-index embedding, modality-based routing, and hybrid retrieval. We not only define each method conceptually but also provide clear, executable Python code examples that illustrate their implementation and practical utility. By applying techniques like query expansion, embedding normalization, and adaptive index refresh, readers will learn to enhance system recall, precision, adaptability, and critical attributes in production-level GenAI systems.

本章的重要性在于其详尽的实践方法,旨在提升检索效率,而检索效率是任何稳健的GenAI流程的基础能力。通过优化,检索组件可以显著提升系统提供上下文准确、及时且有意义的响应的能力,从而直接影响用户体验和AI输出的可信度。

The importance of this chapter lies in its detailed, hands-on approach to improving retrieval effectiveness, a foundational capability for any robust GenAI pipeline. Through optimization, retrieval components can significantly elevate a system's ability to provide contextually accurate, timely, and meaningful responses, thereby directly influencing the user experience and trustworthiness of AI outputs.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 检索优化技术
  • Retrieval optimization techniques
  • 回收系统的缺点
  • Drawbacks retrieval systems
  • 检索优化技术缓解了这些限制
  • Retrieval optimization techniques mitigating the limitations
  • 利用自适应刷新增强多模态 RAG
  • Enhancing multimodal RAG with adaptive refresh
  • 待办事项
  • To do

目标

Objectives

本章旨在帮助读者全面理解构建高性能信息检索系统所必需的检索优化技术。本章重点介绍基于模态的路由、查询扩展、混合检索和交叉编码器重排序等策略,旨在提高搜索任务的召回率和精确率。读者将通过实际代码示例学习如何实现这些技术,从而构建准确、自适应且高效的检索流程。这些技能对于改进现代人工智能系统的基础检索层至关重要,尤其是在多模态和RAG工作流程中。

The objective of this chapter is to equip readers with a comprehensive understanding of retrieval optimization techniques essential for building high-performance information retrieval systems. Focusing on strategies like modality-based routing, query expansion, hybrid retrieval, and cross-encoder reranking, the chapter aims to enhance both recall and precision in search tasks. Readers will learn how to implement these techniques through practical code examples, enabling them to build retrieval pipelines that are accurate, adaptive, and efficient. These skills are crucial for improving the foundational retrieval layer of modern AI systems, particularly in multimodal and RAG workflows.

检索优化技术

Retrieval optimization techniques

我们已经实现了基于交叉编码器和多索引嵌入的重排序。在本章中,我们将重点探索其他检索优化技术,以进一步提高相关性、效率和多模态适应性。

We have already implemented reranking using cross-encoders and multi-index embedding. In this chapter, we now turn our attention to exploring additional retrieval optimization techniques that further improve relevance, efficiency, and multimodal adaptability.

在查询时,用户查询被编码成一个向量,并使用近似最近邻ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。

At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.

这些重新排序后的文档连同原始用户查询一起被送入大型语言模型LLM )进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。

These reranked documents, along with the original user query, are passed into the large language model (LLM) for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.

回收系统的缺点

Drawbacks retrieval systems

多模态随机抽样(RAG)检索系统面临几个关键缺陷,限制了其在实际应用中的有效性。首先,传统的检索流程通常将不同模态独立处理,导致文本和视觉信息的融合效果欠佳。此外,它们严重依赖静态嵌入,而静态嵌入无法捕捉不断变化的用户意图或上下文细微差别。跨模态相关性评分是另一个挑战,常常导致输出结果不相关或不匹配。而且,处理大规模多模态数据集时,延迟会显著增加。

Retrieval systems in multimodal RAG face several critical drawbacks that limit their effectiveness in real-world applications. First, traditional retrieval pipelines often treat modalities independently, leading to suboptimal fusion of textual and visual information. They also rely heavily on static embeddings, which can fail to capture evolving user intent or contextual nuances. Cross-modal relevance scoring is another challenge, often resulting in irrelevant or mismatched outputs. Furthermore, latency increases significantly when dealing with large-scale, multimodal datasets.

请参考以下列表,了解阻碍多模态 RAG 系统准确性和效率的局限性,这些局限性需要更具适应性、智能性和统一性的检索机制才能在未来取得进步:

Refer to the following list to understand the limitations that hinder both the accuracy and efficiency of multimodal RAG systems, necessitating more adaptive, intelligent, and unified retrieval mechanisms for future advancements:

  • 召回率与精确率之间的权衡:信息检索的一个根本局限在于召回率和精确率之间的反比关系。为提高召回率而优化的系统通常会检索到大量结果(确保遗漏的相关文档更少),但代价是包含更多无关结果(精确率降低);而为提高精确率而优化的系统则只返回高度相关的结果,但可能遗漏一些答案。例如,语义嵌入搜索可能捕获概念相关的文档(提高召回率),但也可能包含一些无关内容;而严格的关键词搜索可能只返回完全匹配的结果(精确率高),却忽略了释义结果。平衡这种权衡极具挑战性,没有任何单一设置能够同时优化所有查询的召回率和精确率。因此,检索系统必须在结果的完整性和准确性之间做出妥协。
  • Poor recall vs. precision trade-offs: A fundamental limitation in information retrieval is the inverse relationship between recall and precision. Systems tuned for high recall often retrieve many items (ensuring fewer relevant documents are missed) at the cost of including more irrelevant results (lower precision), whereas tuning for precision returns only highly relevant hits but risks missing some answers. For example, a semantic embedding search might catch conceptually related documents (improving recall) but also pull in tangential content, whereas a strict keyword search might return only exact matches (high precision) while overlooking paraphrased answers. Balancing this trade-off is challenging, and no single setting optimizes both metrics simultaneously for all queries. As a result, retrieval systems must compromise between completeness and accuracy of results.
  • 语义理解有限:传统的文本搜索引擎依赖于词汇匹配,缺乏对查询和文档的深度语义理解。它们将查询视为词袋,因此如果用户的措辞与检索结果不完全匹配,就可能错过相关的文档。这会导致在使用了同义词或上下文相关术语的情况下,检索率很低(例如,如果没有语义建模,关于财务收益的查询可能无法检索到提及季度收入的文档)。即使是稠密向量检索模型,虽然在语义匹配方面有所改进,但也存在局限性;它们可以捕捉到一般含义,但在处理细微的上下文或罕见的、特定领域的关键词时仍然可能失败(例如,通用嵌入模型可能会漏掉词汇搜索能够捕获的精确错误代码)。在多模态环境下,语义鸿沟更大:理解图像的含义或将其与文本查询匹配需要强大的跨模态语义能力,而许多检索系统都难以做到这一点。总体而言,语义理解能力有限意味着系统无法真正理解意图或上下文,导致在缺少确切线索时出现遗漏或不相关的命中。
  • Limited semantic understanding: Traditional text search engines rely on lexical matching and lack deep semantic comprehension of queries and documents. They treat queries as bags of words, so if a user’s phrasing does not exactly match the wording, relevant documents can be missed. This leads to poor recall in cases where synonyms or contextually related terms are used (e.g., a query about financial earnings might not retrieve a document mentioning quarterly revenue without semantic modelling). Even dense vector retrieval models, while better at semantic matching, have limits; they can capture general meaning but may still fail on nuanced context or rare, domain-specific keywords (e.g., a generic embedding model might miss an exact error code that a lexical search would catch). In multimodal contexts, the semantic gap is even wider: understanding the meaning of an image or aligning it with a text query requires robust cross-modal semantics, which many retrieval systems struggle with. Overall, limited semantic understanding means the system does not truly grasp intent or context, resulting in omissions or irrelevant hits when exact cues are absent.
  • 多模态检索中的模态不匹配:当处理多种数据类型(例如,图像和文本)时,检索系统面临着比较和组合不同模态的挑战。图像查询和文本文档存在于非常不同的模态中。不同的特征空间使得衡量它们之间的相似性并非易事。简单地将它们投影到单一索引中可能会导致不匹配:系统可能无法准确地将视觉概念与文本描述对齐。这种跨模态对齐问题是一个已知的缺陷;多模态系统需要为每种模态分别构建独立的嵌入空间,或者学习一个联合空间来比较它们。如果没有适当的对齐,图像+文本检索系统可能会返回错误模态的条目(例如,预期返回图像时却返回文本片段),或者无法检索到相关的跨模态结果。本质上,模态信息表示方式(像素与文字)的差异如果处理不当,会导致检索失败,从而限制系统在查询和目标涵盖图像、文本、音频等多种类型时的有效性。
  • Modality mismatch in multimodal retrieval: When dealing with multiple data types (e.g., images and text together), retrieval systems face the challenge of comparing and combining different modalities. An image query and text documents live in very different feature spaces, and measuring similarity between them is non-trivial. Naively projecting them into a single index can lead to mismatches: the system might not align visual concepts with textual descriptions accurately. This cross-modal alignment problem is a known drawback; multimodal systems require either separate embedding spaces for each modality or a joint space learned to compare them. Without proper alignment, an image + text retrieval system might return items of the wrong modality (e.g., a text snippet when an image was expected) or fail to retrieve relevant cross-modal results. In essence, differences in how modalities represent information (pixels vs. words) can cause retrieval failures if not carefully handled, limiting the system’s effectiveness when queries and targets span images, text, audio, etc.
  • 索引过时(索引失效):检索系统依赖于内容(文档、词嵌入等)的索引,如果不及时更新,索引就会失效。静态索引不会自动整合新添加的文档或现有数据的更新,因此随着时间的推移会失效,这意味着搜索结果可能会遗漏最新信息,或者仍然反映已删除/已更改的内容。索引过时是一个重大缺陷,尤其对于动态语料库而言:系统的知识会冻结在上次索引点。例如,一个不经常重新索引的新闻搜索引擎将无法显示昨天的文章或反映对过去文章的更正。类似地,在 RAG 环境中,如果向量存储没有更新,语言模型可能会检索到过时的信息。如果词嵌入本身发生漂移(例如,如果将更新后的词嵌入模型用于新数据,则旧的词嵌入将不再兼容),这个问题会更加严重。简而言之,过时的索引会降低召回率(错过新的相关项目)和精确度(返回不再相关或准确的内容)。
  • Index staleness (outdated index): Retrieval systems depend on an index of the content (documents, embeddings, etc.), which can become outdated if not refreshed. A static index does not automatically incorporate newly added documents or updates to existing data, so over time it stales, meaning the search results might omit recent information or still reflect removed/changed content. Index staleness is a significant drawback, especially for dynamic corpora: the system’s knowledge freezes at the last indexing point. For example, a news search engine that is not frequently reindexed will fail to surface yesterday’s articles or reflect corrections to past articles. Similarly, in a RAG setting, if the vector store is not updated, the language model may retrieve outdated facts. This issue is compounded if the embeddings themselves drift (e.g., if an updated embedding model is used for new data, old embeddings become incompatible). In short, an out-of-date index can degrade both recall (missing new relevant items) and precision (returning content that is no longer relevant or accurate).
  • 排序效率低下:即使检索到一组候选文档,由于效率限制,按相关性排序也并非总是最优的。许多检索系统在第一阶段使用快速但近似的排序技术(例如,简单的向量相似度或最佳匹配25 ( BM25 ) 分数),这些技术可能无法完全反映真实的相关性。真正最优的排序可能需要更复杂的分析(例如,深度神经网络评分或查询与文档之间的交叉注意力机制),但将这些方法应用于每个候选文档的计算成本很高。因此,存在一个固有的效率低下问题:最精确的排序模型(例如,交叉编码器)对于大型文档集来说速度太慢,而速度更快的方法可能会将一些相关的文档排得比它们应有的位置更低。例如,第一阶段的密集检索器可能在前100个结果中检索到正确的文档,但由于无法完全理解查询的上下文或文档的细微差别,因此不会将其排在第一位。因此,除非采取额外的重新排序步骤,否则相关的结果可能会被埋没,而不相关的结果可能会排名靠前,但这会引入延迟或复杂性。这凸显了高效检索和有效排序之间的差距,也是许多检索流程的一个显著缺陷。
  • Ranking inefficiencies: Even after retrieving a set of candidate documents, ordering them by relevance is not always done optimally due to efficiency constraints. Many retrieval systems use fast but approximate ranking techniques in the first-stage (e.g., simple vector similarity or Best Matching 25 (BM25) scores), which may not perfectly correlate with true relevance. Truly optimal ranking might require more complex analyses (like deep neural scoring or cross-attention between query and document), but applying those to every candidate is computationally expensive. There is thus an inherent inefficiency: the most precise ranking models (e.g., cross-encoders) are too slow for large collections, while faster methods may place some relevant items lower than they deserve. For instance, a first-stage dense retriever might retrieve the right document in the top 100 but not rank it first because it cannot fully understand the query’s context or the document’s nuances. As a result, relevant results can be buried, and irrelevant ones may appear high, unless additional reranking steps are taken, introducing latency or complexity. This highlights a gap between efficient retrieval and effective ranking, a notable shortcoming of many retrieval pipelines.
  • 缺乏上下文感知:传统的检索算法将每个查询孤立地看待,并将每个文档视为独立的文本块,这可能导致结果缺乏上下文感知。在文本搜索中,如果查询含糊不清或过于简短,该系统无法记住用户除查询之外的意图,因此常常返回与上下文不符的结果。在红黄绿(RAG)系统和文档问答QA )中,将文档分割成小块会加剧这个问题:小段文字会丢失源文档的整体上下文。因此,检索到的片段可能在事实上相关,但如果脱离上下文,则难以理解甚至会产生误导。Anthropic 凸显了传统红黄绿系统中的这种上下文难题。例如,如果不清楚指的是哪家公司或哪个时间段,那么“公司营收较上一季度增长了 3%”这样的片段就很难使用。检索器由于缺乏对上下文的感知,可能会检索到表面上回答了问题但却无法提供清晰解释的片段。同样,多模态系统可能会检索到一张图片,却无法理解用户隐含的叙述上下文。缺乏上下文感知意味着,检索结果在技术上可能与关键词相关,但如果上下文没有被保留,则实际上毫无用处,甚至会产生误导。
  • Lack of contextual awareness: Retrieval algorithms traditionally treat each query in isolation and each document as an independent chunk of text, which can lead to context-insensitive results. In textual search, if a query is ambiguous or too short, the system has no memory of the user’s intent beyond that query, often returning contextually off-target results. In RAG systems and document question answering (QA), breaking documents into chunks can exacerbate this issue: small passages lose the broader context of the source. Therefore, a retrieved chunk might be factually relevant but uninterpretable or misleading when taken out of context. Anthropic highlights this context conundrum in traditional RAG. For example, a snippet stating the company’s revenue grew by 3% over the previous quarter is hard to use if it is unclear which company or timeframe is referenced. The retriever, lacking awareness of the surrounding context, might fetch such fragments that answer a question on the surface but fail to provide clarity. Similarly, a multimodal system might retrieve an image without understanding the narrative context that a user implied. The lack of contextual awareness means retrieval results can be technically relevant to keywords but practically unhelpful or even misleading when context is not preserved.

下表概述了信息检索系统中常见的缺陷,并将每项限制与其对检索性能的相应影响进行了映射。理解这些挑战有助于我们了解在现代检索架构中优化召回率、精确率、语义理解、多模态对齐、索引新鲜度、排序有效性和上下文感知所涉及的权衡和复杂性。

The following table outlines common drawbacks encountered in information retrieval systems, mapping each limitation to its corresponding impact on retrieval performance. Understanding these challenges highlights the trade-offs and complexities involved in optimizing recall, precision, semantic comprehension, multimodal alignment, index freshness, ranking effectiveness, and contextual awareness in modern retrieval architectures.

退税

Drawback

影响

Impact

召回率低与精确率之间的权衡

Poor recall vs. precision trade-offs

检索系统必须在完整性(召回率)和准确性(精确率)之间做出权衡,因此很难同时优化这两者。这可能导致检索到不相关的结果或遗漏相关的文档。

Retrieval systems must compromise between completeness (recall) and accuracy (precision), making it challenging to optimize both simultaneously. This may cause the retrieval of irrelevant results or missing relevant documents.

语义理解能力有限

Limited semantic understanding

系统无法理解深层含义或意图,导致在短语不同或上下文有细微差别时错过相关文档,造成遗漏或不相关的搜索结果,尤其是在语义或多模态检索中。

Systems fail to grasp deep meaning or intent, leading to missed relevant documents when phrases differ or context is nuanced, causing omissions or irrelevant hits, especially in semantic or multimodal retrievals.

多模态检索中的模态不匹配

Modality mismatch in multimodal retrieval

不同数据类型(例如图像和文本)之间的错误对齐会导致检索失败,例如返回错误的模态项或缺少相关的跨模态结果,从而降低系统效率。

Incorrect alignment between different data types (e.g., images and text) results in retrieval failures, such as returning wrong modality items or missing relevant cross-modal results, reducing system effectiveness.

索引过时(索引过期)

Index staleness (outdated index)

过时的索引会遗漏最新信息,包含过时的内容,降低召回率和精确度,使检索结果不准确、不及时。

Outdated indexes omit recent information, include obsolete content, and degrade both recall and precision, making retrieval results less accurate and less timely.

排名低效

Ranking inefficiencies

快速但粗略的排名可能会埋没相关文档,抬高不相关文档,从而降低结果排序的有效性,除非采用速度较慢、更复杂的重新排名,但这会增加成本和延迟。

Fast but approximate ranking may bury relevant documents and elevate irrelevant ones, reducing the effectiveness of result ordering unless slower, complex reranking is applied at added cost and latency.

缺乏情境意识

Lack of contextual awareness

由于检索将查询和文档孤立地处理,丢失了更广泛的用户意图和叙述背景,因此检索结果可能在上下文上具有误导性或无用,尤其是在数据碎片化或多模态的情况下。

Results can be contextually misleading or unhelpful because retrieval treats queries and documents in isolation, losing broader user intent and narrative context, especially in fragmented or multimodal data.

表 10.1:检索系统局限性与其对搜索质量和用户体验的实际影响之间的对应关系

Table 10.1: Mapping retrieval system limitations to their practical impacts on search quality and user experience

检索优化技术缓解了这些限制

Retrieval optimization techniques mitigating the limitations

为了克服上述局限性,现代检索系统采用了一系列优化技术。以下方法通过解决特定缺陷,提高了文本、多模态和红黄绿(RAG)场景下的召回率、精确率和相关性。

To mitigate the preceding limitations, modern retrieval systems employ a range of optimization techniques. The following methods enhance recall, precision, and relevance across textual, multimodal, and RAG scenarios by addressing specific drawbacks.

多索引嵌入

Multi-index embedding

多索引嵌入或多向量表示技术并非使用单个向量或单个索引来表示每个文档,而是为每个条目使用多个嵌入,或使用多个按内容划分的索引。一种常见的方法是多向量索引,它将长文档分割成多个部分,每个部分都使用自己的嵌入进行索引。这确保了文档的不同主题方面都能被捕获,从而提高了至少有一个部分与相关查询匹配的概率。其结果是,对于复杂或冗长的文档,可以实现更高的召回率和更精细的语义匹配;系统不再会因为信息隐藏在长文本中而遗漏信息。此外,每个文档使用多个向量可以通过提供更细致的表示来提高精确度;每个向量覆盖特定的上下文,因此文档中无关的部分不太可能导致错误匹配。在实践中,多索引嵌入通过捕获内容的不同方面来提高语义覆盖率和上下文保留率,并增强检索的准确性和理解力。例如,一篇技术论文的摘要、方法和结论可能分别使用不同的嵌入。针对论文方法的提问将直接指向方法嵌入部分,而不是依赖可能丢失细节的单一向量。在 RAG 流水线中,多向量方案同样能够有效地查询长篇知识文章,而不会丢失相关细节,从而解决长文档召回率低的问题,并减少文档中上下文信息的丢失。

Instead of representing each document with a single vector or in a single index, multi-index embedding or multi-vector representations techniques use multiple embeddings per item or multiple indexes specialized by content. One common approach is multi-vector indexing, where a long document is segmented into multiple parts, each indexed by its own embedding. This ensures that different topical aspects of a document are captured, improving the chances that at least one segment will match a relevant query. The result is higher recall and finer semantic matching for complex or lengthy documents; the system no longer misses information just because it was buried in a long text. Moreover, considering multiple vectors per document can improve precision by giving a more nuanced representation; each vector covers a specific context, so irrelevant parts of a document are less likely to cause false matches. In practice, multi-index embedding improves semantic coverage and context retention by capturing different facets of content, and it enhances retrieval accuracy and understanding. For example, a technical paper might have separate embeddings for its abstract, methods, and conclusion. A question about the paper’s method will directly hit the method embedding segment, rather than relying on a single vector that might dilute this detail. In RAG pipelines, multi-vector schemes similarly allow long knowledge articles to be queried effectively without losing pertinent details, thereby addressing poor recall on long documents and mitigating the loss of context within those documents.

基于模态的多模态查询路由

Modality-based routing for multimodal queries

为了解决模态不匹配问题,检索架构引入了基于模态的路由机制,这意味着查询会被定向到特定模态的索引或模型。系统不会将所有数据类型强制整合到单一的同质表示中,而是维护针对文本、图像、音频等优化的独立管道,然后将输出结果进行合并。例如,一个多模态搜索引擎可能拥有一个用于文本段落的向量索引和一个用于图像嵌入的向量索引;如果查询同时包含图像和文本,它会将每个部分路由到相应的索引。这样,每种模态都会使用最合适的检索方法进行处理,例如,图像使用对比语言-图像预训练CLIP )嵌入,文本使用基于双向编码器表示BERT )的嵌入,而不会出现一种模态的噪声干扰另一种模态的情况。通过隔离模态,系统避免了对不可比较特征的直接比较,从而减少了跨模态误差。在实践中,可以并行查询多个索引,然后对结果进行后期融合,从而确保……系统会考虑每种模态的最佳结果。如果用户提出的问题附带图像示例,则可以使用该图像来检索相似图像,同时文本查询可以检索相关文档;然后可以将结果合并。另一种策略是联合嵌入空间(一种模型层面的路由):使用像 CLIP 这样的模型,学习文本和图像的共享向量空间,从而可以直接比较图像查询和图像描述。这使得不同模态能够使用通用的向量语言进行交流,从而大大缓解模态不匹配问题。基于模态的路由(无论是通过单独的索引还是联合嵌入)确保每种数据类型的独特特征都得到尊重,从而通过解决跨模态对齐的挑战来提高多模态检索的精确率和召回率。

To tackle modality mismatch, retrieval architectures introduce modality-based routing, which means that queries are directed to modality-specific indexes or models. Rather than forcing all data types into one homogeneous representation, the system maintains separate pipelines optimized for text, images, audio, etc., and then combines the outputs. For example, a multimodal search engine might have one vector index for text passages and another for image embeddings; if a query contains both an image and text, it routes each part to the appropriate index. This way, each modality is handled with the best-suited retrieval method, e.g., Contrastive Language–Image Pretraining (CLIP) embeddings for images, Bidirectional Encoder Representations from Transformers (BERT) based embeddings for text, without one modality’s noise confusing the other. By isolating modalities, the system avoids direct comparisons of incomparable features and thus reduces cross-modal error. In practice, one can query multiple indexes in parallel and then perform a late fusion of results, ensuring that the top results from each modality are considered. If a user asks a question with an image example attached, the image can be used to fetch similar images while the text query retrieves relevant documents; the results can then be merged. Another strategy is joint embedding spaces (a form of routing at the model level): using models like CLIP, which learn a shared vector space for text and images so that an image query and a caption can be directly compared. This aligns modalities to speak a common language of vectors, greatly alleviating the modality mismatch problem. Modality-based routing (whether via separate indices or joint embeddings) ensures that each data type’s unique characteristics are respected, thereby improving precision and recall in multimodal retrieval by addressing the cross-modal alignment challenge.

查询扩展

Query expansion

查询扩展是提高召回率和弥合文本检索中词汇语义差距的经典技术。其核心思想是使用含义相近的附加词组或短语来扩展用户的查询,包括同义词、相关概念或替代表述。通过自动扩展查询,系统可以检索到原本可能因措辞差异而遗漏的文档。例如,关于“全球变暖的影响”的查询可以扩展为包含“气候变化的影响”等词组,从而将使用这两个词组的文档都纳入考虑范围。这直接解决了召回率与精确率权衡中召回率低的问题:扩展增加了找到的相关结果数量(但会牺牲一些精确率)。在实践中,现代系统使用词库、语言模型甚至语言学习模型(LLM)来生成扩展。在 RAG 的背景下,查询扩展可以为检索器提供问题的多种表述方式,从而为生成器生成更丰富的上下文段落。虽然这可能会引入一些不相关的结果(因为查询范围更广),但它显著降低了遗漏隐藏在不同术语背后的相关信息的概率。智能扩展策略(例如,仅添加高度相关的同义词,或利用初始结果的反馈进一步扩展)有助于在提高召回率的同时保持精确度。通过覆盖更广泛的语义范围,查询扩展可以弥补严格关键词搜索语义理解的不足,并提高系统在词汇不匹配的情况下查找相关数据的能力。

Query expansion is a classic technique to improve recall and bridge lexical-semantic gaps in textual retrieval. The idea is to expand the user’s query with additional terms or phrases that have similar meaning, including synonyms, related concepts, or alternate formulations. By automatically broadening the query, the system retrieves documents it might otherwise miss due to wording differences. For instance, a query on global warming effects could be expanded with terms like climate change impacts so that documents using either term are considered. This directly addresses the poor recall aspect of the recall-precision trade-off: expansion increases the number of relevant results found (at some cost to precision). In practice, modern systems use thesauri, language models, or even LLMs to generate expansions. In the context of RAG, query expansion can feed the retriever multiple reformulations of a question, yielding a richer set of context passages for the generator. While this may introduce a few more irrelevant hits (since the query is broader), it significantly reduces the chance of missing pertinent information hidden behind different terminology. Smart expansion strategies (e.g., only adding highly relevant synonyms, or using feedback from initial results to expand further) help maintain precision while boosting recall. By covering more semantic ground, query expansion mitigates the limited semantic understanding of strict keyword search and improves the system’s ability to find relevant data despite vocabulary mismatches.

嵌入归一化

Embedding normalization

嵌入归一化是基于向量的检索中一种底层但至关重要的优化。它解决了一个常被忽视的问题:向量嵌入的长度(大小)可能不同,这会影响相似度计算。例如,如果一个文档的嵌入范数大于另一个文档,即使方向(语义内容)不太一致,它与查询的点积相似度得分也可能更高。归一化(通常是 L2 归一化到单位长度)确保所有向量都位于同一个超球面上,从而使相似度完全由角度(余弦相似度)而非向量长度决定。这提高了检索的语义保真度——文档的检索是基于其内容的真正相似性,而不仅仅是因为它们的嵌入范数更大。归一化的嵌入还带来了数值稳定性和一致性:最大化内积等价于最大化余弦相似度,使得检索指标表现良好且在不同查询之间具有可比性。在实践中,许多嵌入模型已经输出归一化向量,或者提供了输出归一化向量的选项;如果没有,向量数据库通常允许标记数据应被视为已归一化。通过防止任何单个向量因长度异常而占据主导地位,归一化可以产生更可靠的结果排序(解决了排序效率低下的一个微妙问题)。这在多模态场景或合并来自不同模型的结果时尤为重要,因为它们的嵌入尺度可能不同。确保统一的尺度可以消除一个误差来源,使检索能够专注于真正的语义相似性。总之,嵌入归一化微调了检索引擎的数学基础,从而提高了结果的精确度和一致性。

Embedding normalization is a low-level but crucial optimization in vector-based retrieval. It addresses an often-overlooked issue: vector embeddings can vary in length (magnitude), which can skew similarity computations. For example, if one document’s embedding has a larger norm than another’s, it might score higher on a dot product similarity with a query even if the direction (semantic content) is less aligned. Normalization (typically L2 normalization to unit length) ensures that all vectors lie on the same hypersphere, so that similarity is determined purely by angle (cosine similarity) rather than vector length. This improves the semantic fidelity of retrieval—documents are retrieved for being truly similar in content, not just because their embedding has a larger magnitude. Normalized embeddings also bring numerical stability and consistency: maximizing inner product becomes equivalent to maximizing cosine similarity, making the retrieval metric well-behaved and comparable across queries. In practice, many embedding models already output normalized vectors or have an option to do so; if not, vector databases often allow flagging that data should be treated as normalized. By preventing any single vector from dominating due to length anomalies. normalization yields a more reliable ranking of results (addressing a subtle ranking inefficiency). It is especially important in multimodal settings or when merging results from different models, as their embedding scales might differ. Ensuring a uniform scale removes one source of error, letting the retrieval focus on true semantic similarity. In summary, embedding normalization fine-tunes the retrieval engine’s mathematical underpinning to enhance precision and consistency in results.

混合检索

Hybrid retrieval

混合检索结合了关键词(词汇)搜索和向量(语义)搜索的优势,以克服每种方法的不足。混合系统并非依赖单一方法,而是同时执行词汇匹配(例如,BM25 或 TF-IDF 索引)和语义相似性搜索(通过词嵌入),然后将结果融合。这项技术直接解决了纯关键词搜索语义理解能力有限的问题,以及纯语义搜索可能遗漏精确或罕见术语的问题。通过结合使用这两种方法,系统可以在精确匹配术语和更广泛的语义覆盖范围之间取得平衡。例如,考虑一个包含特定错误代码和一般问题描述的技术查询:BM25 组件将确保检索到包含该精确错误代码的文档,而词嵌入组件将检索到描述一般问题的文档,即使它们的表述方式有所不同。然后,​​对合并后的候选列表进行排序融合或重新排序,即可得到比任何单一方法都更全面、更相关的最终排名。现代 RAG 流程经常使用这种方法;首先,通过词汇搜索收集前 N 个段落,通过向量搜索收集前 M 个段落,然后对它们进行去重和重新排序。结果显著提高了召回率和精确率,Anthropic 的示例就证明了这一点,其中同时使用这两种方法可以返回更多适用于生成的文本块。混合检索还可以缓解上下文丢失:词汇匹配可以提供嵌入可能忽略的精确上下文标识符(例如名称或数字),从而将语义结果锚定在具体细节上。总而言之,这种优化通过有效地结合两种评分信号来解决召回率/精确率之间的权衡,从而产生既准确又具有语义感知能力的检索结果。

Hybrid retrieval combines the strengths of keyword (lexical) search and vector (semantic) search to overcome each method’s weaknesses. Rather than relying on one approach, a hybrid system performs both a lexical match (e.g., BM25 or TF-IDF index) and a semantic similarity search (via embeddings) and then merges the results. This technique directly confronts the limited semantic understanding of pure keyword search and the complementary issue that pure semantic search can miss exact or rare terms. By using both, the system can balance precise term matching with broader semantic coverage. For example, consider a technical query containing a specific error code and a general problem description: the BM25 component will ensure documents containing that exact error code are retrieved, while the embedding component will fetch documents about the general problem even if they phrase it differently. Rank fusion or reranking of the combined candidate list then yields a final ranking that is more comprehensive and relevant than either method alone. Modern RAG pipelines frequently use this approach; first, gather a set of top-N passages by lexical search and top-M by vector search, then deduplicate and rerank them together. The result is significantly improved recall and precision, as evidenced by Anthropic’s example, where using both methods returns more applicable chunks for generation. Hybrid retrieval also mitigates context loss: lexical matches can provide the exact contextual identifiers (like names or numbers) that an embedding might overlook, anchoring the semantic results in concrete details. Overall, this optimization addresses recall/precision trade-offs by effectively combining two scoring signals, yielding a retrieval that is both accurate and semantically aware.

分数标准化

Score normalization

混合检索系统融合了基于关键词(词汇)搜索和语义(向量)搜索的输出结果,但其主要技术挑战在于如何将二者截然不同的评分机制整合为统一的排名。词汇模型(如BM25)基于词频和文档统计数据生成分数,而向量搜索则提供相似度度量(通常是余弦距离或欧氏距离),这些度量方法无法直接比较,甚至数值尺度也不相同。

Hybrid retrieval systems merge the outputs of keyword-based (lexical) search and semantic (vector) search, but a major technical challenge lies in combining their fundamentally different scoring schemes into a unified ranking. Lexical models like BM25 produce scores based on term frequency and document statistics, while vector search provides similarity measures, often cosine or Euclidean distance, which are not directly comparable or even on the same numeric scale.

为了解决这个问题,在合并结果之前会应用分数归一化技术。归一化过程将每种方法的分数转换到通用尺度(通常是标准化的z分数),从而实现公平的组合和融合。典型的策略包括:

To address this, score normalization techniques are applied before merging results. The normalization process transforms scores from each method into a common scale (often or standardized z-scores), allowing fair combination and fusion. Typical strategies include:

  • 最小-最大归一化:每组分数(BM25 和嵌入)都被缩放,使其最小值变为 0,最大值变为 1,从而保留每组内的排名,同时实现直接比较。
  • Min-max normalization: Each set of scores (BM25 and embedding) is scaled so its minimum becomes 0 and maximum becomes 1, preserving ranking within each group while enabling direct comparison.
  • Z 分数标准化:根据分数的分布对其进行标准化,使其居中,从而可以适当地处理异常值。
  • Z-score normalization: Scores are standardized based on their distribution, centering them and allowing outliers to be managed appropriately.
  • 基于排名的融合:不是合并原始分数,而是对每种方法中的项目进行排序,并通过交错或求和它们的排名来进行融合。
  • Rank-based fusion: Instead of merging raw scores, items are ordered within each method, and fusion is done by interleaving or summing their ranks.
  • 学习融合或加权融合:训练模型或使用启发式方法,根据相关性反馈对归一化分数进行最优加权或组合。
  • Learned or weighted fusion: Training a model or using heuristics to optimally weigh or combine the normalized scores based on relevance feedback.

例如,在一个典型的混合流程中,首先从 BM25 算法中选取前 N 个结果,从嵌入搜索中选取前 M 个结果。然后对它们的得分进行归一化,合并重复结果(通常保留每种方法得分最高的结果),并使用融合(组合或加权)后的得分对最终列表进行重新排序。此过程确保精确的关键词匹配(例如,ID 或罕见词)不会被语义相似但不够精确的内容所掩盖,反之亦然。

For example, in a typical hybrid pipeline, top-N results from BM25 and top-M from embedding search are first selected. Their scores are then normalized, duplicate hits are merged (often keeping the best score per method), and the final list is reranked using the fused (combined or weighted) scores. This process ensures that precise keyword matches (e.g., for IDs or rare terms) are not overshadowed by semantically similar but less precise content, and vice versa.

分数归一化对于混合检索至关重要,它可以避免由于数值尺度差异而导致一种模态占据主导地位,最终使系统能够利用词汇精确性和语义广度的优势,从而获得最佳的检索性能。

Score normalization is essential for hybrid retrieval to avoid one modality dominating due to numerical scale differences, ultimately enabling the system to leverage the strengths of both lexical precision and semantic breadth for the best possible retrieval performance.

使用交叉编码器进行重排序

Reranking with cross-encoders

我们已经对使用交叉编码器进行重排序有了初步的了解。然而,让我们更深入地了解它。为了解决第一阶段检索排序效率低下的问题,系统通常会采用交叉编码器(或其他强大的重排序器)进行重排序。交叉编码器是一种Transformer模型,它将查询和候选文档作为输入,并生成相关性得分,从而有效地执行包含完整上下文的深度语义比较。这比双向编码器模型中使用的独立编码(其中查询和文档分别嵌入)要准确得多。当然,缺点是对所有可能的文档都进行这样的操作是不切实际的;但是,对一小部分排名靠前的候选文档(例如,来自初始检索器的前50或100个)进行操作通常是可行的。因此,策略是使用快速检索器获取候选文档池,然后应用交叉编码器对这些候选文档进行高精度重排序。这种两阶段方法通过结合速度和准确性解决了之前遇到的权衡问题。交叉编码器可以纠正第一阶段的错误。例如,可以发现排名靠前的段落可能只是表面上与查询匹配,并不相关,因此将其降级到真正相关的段落之下,而这些段落可能在初始阶段排名更低。经验表明,添加交叉编码器重排序器可以显著提升平均倒数排名( MRR ) 或 precision@k 等指标,因为它过滤掉了误报。基于对查询上下文和文档内容的更深入理解,对结果进行重新排序。在 RAG 系统中,改进的重排序意味着 LLM 可以获得更多相关的上下文段落,从而直接提高答案质量。代价是需要额外的计算,但存在一些优化方法(例如,使用更小的交叉编码器或仅对子集进行重排序)。总而言之,交叉编码器重排序是针对排序效率低下的一种有针对性的解决方案,因为它在最终排名前列的候选答案时,注入了高上下文、高精度的判断,以确保结果尽可能地相关且符合上下文。

We have already built an understanding of reranking with a cross-encoder. However, let us understand it more thoroughly. To tackle the ranking inefficiency of first-stage retrieval, systems often employ a reranking step with cross-encoders (or other powerful rerankers). A cross-encoder is a transformer model that takes a query and a candidate document together as input and produces a relevance score, effectively performing a deep semantic comparison with full context. This is far more accurate than the independent encoding used in bi-encoder models (where query and document are embedded separately). The drawback, of course, is that doing this for every possible document is infeasible; however, doing it for a small set of top candidates (say, top 50 or 100 from the initial retriever) is usually manageable. The strategy, therefore, is to use a fast retriever to get a candidate pool, then apply a cross-encoder to rerank those candidates with high precision. This two-stage approach addresses the earlier trade-off by combining speed and accuracy. The cross-encoder corrects the mistakes of the first-stage. For instance, it can be noticed that a top-ranked passage is only superficially matching the query and not relevant, demoting it below a truly relevant passage that maybe the initial stage had a lower rank. Empirically, adding a cross-encoder reranker significantly boosts metrics like Mean Reciprocal Rank (MRR) or precision@k, as it filters out false positives and reorders results based on a richer understanding of query context and document content. In a RAG system, improved reranking means the LLM gets more relevant grounding passages, directly improving answer quality. The cost is extra computation, but optimizations exist (e.g., using smaller cross-encoders or only reranking a subset). Overall, cross-encoder reranking is a targeted fix for ranking inefficiency as it injects a high-context, high-precision judgment just where it is needed, at the final ranking of top candidates to ensure the results are as relevant and contextually appropriate as possible.

预滤波阈值

Prefiltering thresholds

预过滤阈值是多阶段检索系统中常用的一种优化方法,用于降低使用交叉编码器进行重排序的计算延迟。由于对每个候选文档都进行交叉编码器评估速度极慢,预过滤阈值有助于确保只有最有希望的候选文档(即最有可能相关的文档)才会被传递给计算量较大的重排序过程。以下是其工作原理和有效性:

Prefiltering thresholds are a practical optimization used in multi-stage retrieval systems to reduce the computational latency of reranking with cross-encoders. Since evaluating a cross-encoder on every candidate document is prohibitively slow, prefiltering thresholds help ensure that only the most promising candidates (i.e., those likely to be relevant) are passed on for costly reranking. Here is how this works and why it is effective:

  • 预过滤阈值是应用于初始候选池(由快速检索器,如 BM25 或双编码器产生)的标准,用于在重新排序之前排除得分低或明显不相关的候选者。
  • Prefiltering thresholds are criteria applied to the initial candidate pool (produced by a fast retriever, such as BM25 or a bi-encoder) to exclude low-scoring or obviously irrelevant candidates before reranking.
  • 阈值可以是分数截断值(仅对得分高于 X 的候选人进行重新排名),也可以是前 N 个截断值(仅对初始得分最高的 50 或 100 个候选人进行重新排名),甚至可以是混合值(例如,所有高于高分的文档加上最多 N 个总分)。
  • The threshold can be a score cutoff (only rerank candidates with a score above X) or a top-N cutoff (rerank only the top 50 or 100 candidates by initial score), or even a hybrid (e.g., all documents above a high score plus up to N total).

预过滤阈值的重要性如下:

The importance of Prefiltering thresholds is as follows:

  • 降低重排序延迟:交叉编码器计算量很大,因为它们在一个深度模型中同时处理查询和文档。预过滤缩小了候选集,因此需要重排序的项目更少,从而大大加快了第二阶段的速度。
  • Reduces reranking latency: Cross-encoders are computationally expensive because they process the query and document together in a deep model. Prefiltering shrinks the candidate set, so fewer items need to be reranked, making the second stage much faster.
  • 保持高质量结果:通过精心设定的阈值,大多数真正相关的文档仍然可以进入重新排序阶段,因此最终准确率不会受到影响,并且避免将计算资源浪费在可能不相关的文档上。
  • Maintains high-quality results: With a carefully set threshold, most truly relevant documents still make it into the reranking stage, so final accuracy does not suffer, and you avoid wasting compute on likely-irrelevant documents.
  • 平衡精度和效率:通过调整阈值(例如,增加 N 或降低分数截止值),系统可以找到结果质量和响应时间之间的最佳权衡,并根据可用的计算或延迟预算进行调整。
  • Balances precision and efficiency: By tuning the threshold (e.g., increasing N or lowering the score cutoff), systems can find the best trade-off between result quality and response time, adjusting for available compute or latency budgets.

以下是一个实现示例:

The following is an example of implementation:

  • 假设你的第一阶段检索器(BM25 或向量搜索)检索到 1,000 个候选对象。你设置了一个预过滤阈值,按分数筛选前 64 个候选对象(N=64 )。
  • Suppose your first-stage retriever (BM25 or vector search) retrieves 1,000 candidates. You set a prefiltering threshold of top-64 by score (N=64).
  • 只有这 64 个结果会被发送到交叉编码器重排序器。交叉编码器会生成一个高度精确的相关性排序结果,并且只有前 k 个结果(例如 5 个或 10 个)会被返回给用户或 LLM。
  • Only these 64 are sent to the cross-encoder reranker. The cross-encoder produces a highly accurate relevance order, and only the top-k (say, 5 or 10) are returned to the user or LLM.
  • 您也可以同时使用分数和排名阈值来进一步缩小范围,例如,BM25 分数 > 5.0 且排名前 100。
  • Optionally, you can use both a score and rank threshold to further tighten the pool, e.g., BM25 score > 5.0 AND top 100 by rank.

以下是其优点:

The following are the benefits:

  • 显著提升速度:大幅减少每次查询中缓慢且昂贵的重新排名计算次数。
  • Significant speedup: Dramatically reduces the number of slow, expensive reranking computations per query.
  • 灵活性:可以根据具体用例调整阈值,以根据要求优化延迟、质量或成本。
  • Flexibility: Thresholds can be tuned for specific use cases to optimize latency, quality, or cost depending on requirements.
  • 质量控制:确保只有合格的候选人才能进入最终结果评选,从而降低低质量答案出现的可能性。
  • Quality control: Ensures only reasonable candidates are considered for the final results, decreasing the chance of low-quality answers.

预过滤阈值充当快速检索和交叉编码器缓慢但精确的重排序之间的智能过滤器,确保只有最有希望的文档才会被重排序。这种方法通过将重排序工作量减少到可控的高概率子集,使您能够享受到交叉编码器的高精度,而无需为每个候选文档承担过高的推理成本或延迟。

Prefiltering thresholds act as a smart filter between fast retrieval and slow, accurate reranking by cross-encoders, ensuring only the most promising documents are reranked. This approach enables you to enjoy the high precision of a cross-encoder—but without incurring prohibitive inference cost or latency for every candidate—by reducing the reranking workload to a manageable, high-likelihood subset.

自适应索引刷新

Adaptive index refresh

最后,为了应对索引过时和嵌入漂移,检索系统实现了如图 10.1所示的自适应索引刷新策略。这意味着索引并非一劳永逸的静态结构,而是会按计划或根据变化进行更新。其中一个方面是增量索引:随着新文档的到来或现有文档的更改,它们会被持续或定期地添加到搜索索引中(或重新索引),而不是等待完整的重新索引。这可以保持内容的时效性,并确保检索到最新信息。在实践中,生产系统拥有一个索引更新管道,该管道会提供新数据并使用后台处理来保持向量存储的最新状态。另一个方面是适应嵌入模型本身的变化。如果系统的向量编码器被重新训练或替换(例如,部署了更新的语言模型),则存储的嵌入可能不再兼容或最优。自适应刷新意味着在模型更新或检测到显著漂移时重新嵌入语料库。监控可用于判断何时需要重新嵌入(例如,相似度得分开始下降或召回率下降时)。通过在最新模型上重新计算嵌入并将其替换到索引中,系统可以保持查询嵌入和文档嵌入之间的一致性,从而防止因嵌入漂移而导致的相关性不匹配。总之,自适应索引刷新通过确保检索索引始终反映数据和模型的理解,解决了索引过时的问题。这使得检索更加准确及时:可以搜索新知识,并且相似度比较结果始终有效。这方面的技术包括计划性重新索引、流数据的实时索引,以及混合方法(如果新数据尚未索引,则实时搜索新数据,然后回退到慢速搜索)。这些实践共同保证了检索系统的知识保持最新,并且其向量空间保持一致,从而在内容和模型不断演变的情况下维持检索性能。

Finally, to combat index staleness and embedding drift, retrieval systems implement adaptive index refresh policies as shown in Figure 10.1. This means the index is not a once-and-done static structure but is updated on a schedule or in response to changes. One aspect is incremental indexing: as new documents arrive or existing ones change, they are added to (or reindexed in) the search index continuously or periodically, rather than waiting for a complete reindexing. This keeps the content fresh and ensures recall of up-to-date information. In practice, production systems have an index update pipeline that feeds new data and uses background processing to keep the vector store current. Another aspect is adapting to changes in the embedding model itself. If the system’s vector encoder is retrained or replaced (for example, a newer language model is deployed), the stored embeddings may no longer be compatible or optimal. Adaptive refresh entails re-embedding the corpus when models are updated or when significant drift is detected. Monitoring can be used to decide when re-embedding is necessary (e.g., if similarity scores start degrading or recall@k drops). By re-computing embeddings on the latest model and swapping them into the index, the system maintains alignment between query embeddings and document embeddings, preventing the relevance mismatches that arise from embedding drift. In sum, adaptive index refresh addresses the staleness drawback by ensuring the retrieval index remains a living reflection of both the data and the model’s understanding. This results in more accurate and timely retrieval: new knowledge is searchable, and the similarity comparisons remain valid over time. Techniques in this vein include scheduled reindexing, real-time indexing for streaming data, and hybrid approaches where recent data is searched live (fallback to slow search) if not yet indexed. Together, these practices guarantee that the retrieval system’s knowledge stays current and its vector space stays consistent, thus upholding retrieval performance in the face of evolving content and models.

因此,现代检索系统,无论是纯文档检索、多模态检索还是随机抽取检索(RAG),都远非静态的关键词匹配器。它们是复杂且不断演进的系统,必须克服召回率、精确率、语义理解、跨模态对齐和上下文处理等方面的根本性限制。通过应用上述优化技术,此类系统的鲁棒性和相关性显著提升:多向量表示丰富了索引内容,模态特定处理对齐了不同的数据类型,查询扩展拓宽了搜索范围,嵌入归一化和混合搜索优化了匹配过程,重排序器注入了智能排序,持续的索引更新则确保系统始终保持最新状态。每项技术都针对特定的缺陷,它们共同作用,使检索流程能够在日益多样化和要求更高的应用中提供高质量、上下文感知的结果。这些方法的相互作用充分展现了概念创新(而不仅仅是数学复杂性)如何能够显著提升信息检索的性能和可靠性。

So, modern retrieval systems, whether pure document search, multimodal search, or RAG are far from static keyword-matchers. They are complex, evolving systems that must overcome fundamental limitations in recall, precision, semantic understanding, cross-modal alignment, and context handling. By applying the above optimization techniques, such systems markedly improve in robustness and relevance: multi-vector representations enrich what is indexed, modality-specific handling aligns disparate data types, query expansion broadens the search horizon, embedding normalization and hybrid search refine the matching process, rerankers inject intelligent ordering, and continuous index refresh keeps the system up-to-date. Each technique targets specific drawbacks, and together they enable retrieval pipelines to provide high-quality, context-aware results in increasingly diverse and demanding applications. The interplay of these methods exemplifies how conceptual innovation (rather than just mathematical complexity) can drive substantial improvements in information retrieval performance and reliability.

检索优化技术

Retrieval optimization techniques

在高性能检索系统中,尤其是在支持多模态输入和红绿灯(RAG)的系统中,优化检索过程对于实现高精度、高召回率和上下文相关性至关重要。本节详细介绍了使用 Python 和 Qdrant 实现的核心检索优化策略,其中词嵌入由句子转换器(Sentence Transformer)生成。每项技术都源于一个实际挑战,并通过模块化、可重用的代码进行验证;详情如下:

In high-performance retrieval systems, especially those supporting multimodal inputs and RAG, optimizing the retrieval process is essential for achieving high precision, recall, and contextual relevance. This section details the implementation of core retrieval optimization strategies using Python and Qdrant, with embeddings generated via Sentence Transformers. Each technique is motivated by a real-world challenge and substantiated with modular, reusable code; details as follows:

  • 基于模态的路由
    • 挑战:在多模态系统中,文本、图像和音频等不同类型的数据需要专门的处理。统一的向量空间通常无法捕捉每种模态的独特语义。
    • 解决方案:基于模态的路由根据检测到的输入模态或意图,将用户查询定向到相应的索引(文本、图像等):

      def route_query(query: str, modality: str = "text") -> str:

      路由表 = {

      "text": "text_index",

      "image": "image_index",

      “多模态”: “混合索引”

      }

      返回 routing_table.get(modality, "text_index")

      此功能确保每个查询都由最相关的子索引处理,避免模态不匹配,提高检索精度。

  • Modality-based routing:
    • Challenge: In multimodal systems, different data types like text, images, and audio require specialized processing. A unified vector space often fails to capture the distinct semantics of each modality.
    • Solution: Modality-based routing directs the user query to the appropriate index (textual, visual, etc.) based on the detected input modality or intent:

      def route_query(query: str, modality: str = "text") -> str:

      routing_table = {

      "text": "text_index",

      "image": "image_index",

      "multimodal": "hybrid_index"

      }

      return routing_table.get(modality, "text_index")

      This function ensures that each query is processed by the most relevant sub-index, avoiding modality mismatch and improving retrieval precision.

  • 查询扩展
    • 挑战:词汇不匹配(例如,carautomobile )限制了稀疏检索系统和密集检索系统的回忆率
    • 解决方案:查询扩展通过在原始查询中添加相关术语来增加语义覆盖范围:

      def query_expansion(query: str) -> list:

      synonym_dict = {

      “气候”: [“环境”,“天气”]

      “汽车”:[“车辆”,“汽车”]

      }

      words = query.split()

      expanded = set(words)

      for word in words:

      如果单词在 synonym_dict 中:

      expanded.update(synonym_dict[word])

      返回列表(展开后)

    通过将气候扩展到包括环境和天气,检索系统更有可能返回使用替代术语的、概念相关的文档。

  • Query expansion:
    • Challenge: Lexical mismatches (e.g., car vs. automobile) limit recall in both sparse and dense retrieval systems.
    • Solution: Query expansion increases semantic coverage by adding related terms to the original query:

      def query_expansion(query: str) -> list:

      synonym_dict = {

      "climate": ["environment", "weather"],

      "car": ["vehicle", "automobile"]

      }

      words = query.split()

      expanded = set(words)

      for word in words:

      if word in synonym_dict:

      expanded.update(synonym_dict[word])

      return list(expanded)

    By expanding climate to include environment and weather, the retrieval system is more likely to return conceptually relevant documents that use alternate terminology.

  • 嵌入归一化
    • 挑战:在向量检索系统中,未归一化的嵌入会导致相似度得分受向量大小的影响,而不是受实际语义接近程度的影响。
    • 解决方案:将所有嵌入归一化为单位长度(L2范数),以确保余弦相似度计算的一致性:

      def normalize_embedding(embedding: np.ndarray) -> np.ndarray:

      norm = np.linalg.norm(embedding)

      如果 norm != 0,则返回嵌入 / norm;否则返回嵌入。

    该函数保证所有嵌入都位于单位超球面上,从而确保语义相似性仅通过角度距离来判断,进而提高跨索引的评分可靠性。

  • Embedding normalization:
    • Challenge: In vector retrieval systems, unnormalized embeddings can result in similarity scores biased by vector magnitude, not actual semantic closeness.
    • Solution: Normalize all embeddings to unit length (L2 norm) to ensure consistent cosine similarity calculations:

      def normalize_embedding(embedding: np.ndarray) -> np.ndarray:

      norm = np.linalg.norm(embedding)

      return embedding / norm if norm != 0 else embedding

    This function guarantees that all embeddings lie on a unit hypersphere, ensuring semantic similarity is judged by angular distance alone, thus improving scoring reliability across indexes.

  • 加权嵌入融合
    • 挑战:在多模态融合中,对文本和图像嵌入进行简单的平均可能会削弱一种模态的主要信号。
    • 解决方案:加权嵌入融合使用特定领域的权重来组合嵌入:

      def weighted_embedding_fusion(text_emb: np.ndarray, image_emb: np.ndarray, text_weight: float = 0.6) -> np.ndarray:

      融合值 = 文本权重 * 文本嵌入 + (1 - 文本权重) * 图像嵌入

      返回 normalize_embedding(fused)

      这种融合技术允许偏向更可靠的模态(例如,法律文件中的文本、电子商务中的图像),并确保生成的向量仍然能够进行归一化以进行相似性搜索。

  • Weighted embedding fusion:
    • Challenge: In multimodal fusion, naive averaging of text and image embeddings can dilute the dominant signal of one modality.
    • Solution: Weighted embedding fusion combines embeddings using domain-specific weights:

      def weighted_embedding_fusion(text_emb: np.ndarray, image_emb: np.ndarray, text_weight: float = 0.6) -> np.ndarray:

      fused = text_weight * text_emb + (1 - text_weight) * image_emb

      return normalize_embedding(fused)

      This fusion technique allows biasing towards more reliable modalities (e.g., text in legal documents, image in e-commerce), and ensures the resulting vector is still normalized for similarity search.

  • 分数融合与聚合
    • 挑战:当从多个索引(例如,文本 + 图像)检索时,简单地合并结果可能会导致排名不理想。
    • 解决方案:使用倒数排名融合RRF )根据排名位置而不是原始分数来公平地汇总结果:

      def score_fusion(results_a: list, results_b: list, method: str = "reciprocal_rank") -> list:

      def reciprocal_rank(score, rank):

      返回 1 / (排名 + 1)

      fused_scores = {}

      对于 rank、item 在 enumerate(results_a):

      fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)

      对于 rank、item 在 enumerate(results_b):

      fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)

      merged = [{"id": k, "fused_score": v} for k, v in fused_scores.items()]

      返回 sorted(merged, key=lambda x: x["fused_score"], reverse=True)

    该技术通过确保在任一列表中排名靠前的结果在合并输出中得到公平推广,从而减轻模态偏差。

  • Score fusion and aggregation:
    • Challenge: When retrieving from multiple indexes (e.g., text + image), combining results naïvely can lead to suboptimal rankings.
    • Solution: Use Reciprocal Rank Fusion (RRF) to aggregate results fairly based on rank-position rather than raw score:

      def score_fusion(results_a: list, results_b: list, method: str = "reciprocal_rank") -> list:

      def reciprocal_rank(score, rank):

      return 1 / (rank + 1)

      fused_scores = {}

      for rank, item in enumerate(results_a):

      fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)

      for rank, item in enumerate(results_b):

      fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)

      merged = [{"id": k, "fused_score": v} for k, v in fused_scores.items()]

      return sorted(merged, key=lambda x: x["fused_score"], reverse=True)

    This technique mitigates modality bias by ensuring that results highly ranked in either list are promoted fairly in the merged output.

  • 上下文过滤
    • 挑战:检索系统经常返回技术上相关但上下文不恰当的结果,例如过时的文档或低可信度的来源。
    • 解决方案:应用基于元数据(例如,来源类型、年份、地区)的上下文过滤:

      from qdrant_client.http.models import Filter, FieldCondition, MatchValue

      def filter_by_metadata(source: str = None, year: int = None) -> Filter:

      条件 = []

      如果来源:

      conditions.append(FieldCondition(key="source", match=MatchValue(value=source)))

      如果年份:

      conditions.append(FieldCondition(key="year", match=MatchValue(value=year)))

      返回筛选条件(必须=条件)

    Qdrant 允许在查询时通过元数据进行过滤。此功能可用于优先处理来自可靠来源或相关时间范围内的文档。

  • Contextual filtering:
    • Challenge: Retrieval systems often return technically relevant but contextually inappropriate results, e.g., outdated documents or low-credibility sources.
    • Solution: Apply contextual filtering based on metadata (e.g., source type, year, region):

      from qdrant_client.http.models import Filter, FieldCondition, MatchValue

      def filter_by_metadata(source: str = None, year: int = None) -> Filter:

      conditions = []

      if source:

      conditions.append(FieldCondition(key="source", match=MatchValue(value=source)))

      if year:

      conditions.append(FieldCondition(key="year", match=MatchValue(value=year)))

      return Filter(must=conditions)

    Qdrant allows filtering at query time via such metadata. This function can be used to prioritize documents from reliable sources or within a relevant timeframe.

  • 自适应索引刷新
    • 挑战:随着文档的更新或嵌入模型的改进,索引会逐渐失效。如果不进行更新,检索准确率就会下降。
    • 解决方案:自适应索引刷新重新嵌入文档并重建矢量索引:

      from qdrant_client.http.models import VectorParams, Distance, PointStruct

      def refresh_index(collection_name: str, data: list, encoder, vector_size: int, qdrant_client):

      qdrant_client.recreate_collection(

      collection_name=collection_name,

      vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

      points = []

      对于数据中的每个项目:

      text = item.get("text") 或 item.get("desc")

      vector = normalize_embedding(encoder.encode(text))

      points.append(PointStruct(id=item["id"], vector=vector.tolist(), payload=item["metadata"]))

      qdrant_client.upsert(collection_name=collection_name, points=points)

    此功能允许定期或事件驱动的重新索引,确保存储的数据、元数据和不断发展的模型之间的一致性

  • Adaptive index refresh:
    • Challenge: Indexes grow stale as documents change or embedding models improve. Without refreshing, retrieval accuracy degrades.
    • Solution: Adaptive index refresh re-embeds documents and rebuilds the vector index:

      from qdrant_client.http.models import VectorParams, Distance, PointStruct

      def refresh_index(collection_name: str, data: list, encoder, vector_size: int, qdrant_client):

      qdrant_client.recreate_collection(

      collection_name=collection_name,

      vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

      )

      points = []

      for item in data:

      text = item.get("text") or item.get("desc")

      vector = normalize_embedding(encoder.encode(text))

      points.append(PointStruct(id=item["id"], vector=vector.tolist(), payload=item["metadata"]))

      qdrant_client.upsert(collection_name=collection_name, points=points)

    This function allows periodic or event-driven reindexing, ensuring alignment between stored data, metadata, and evolving models.

  • 基于遗传算法的检索优化
    • 挑战:选择最佳的检索配置组合(例如模态权重、查询扩展策略、过滤阈值和重排序参数)通常需要手动调优,这既耗时又无法达到最优效果。此外,检索环境是非凸的、多目标的,这使得传统的优化技术效果不佳。
    • 解决方案遗传算法GA )提供了一种受生物进化启发的基于种群的搜索方法,适用于同时优化多个参数的检索流程。遗传算法通过选择、交叉和变异来进化候选配置,其适应度函数通常由检索性能指标(例如,归一化折扣累积增益NDCG )、MRR、precision@k)定义。
  • Retrieval optimization using genetic algorithms:
    • Challenge: Selecting optimal combinations of retrieval configurations such as modality weights, query expansion strategies, filter thresholds, and reranking parameters often involves manual tuning, which is both time-consuming and suboptimal. Moreover, the retrieval landscape is non-convex and multi-objective, making traditional optimization techniques less effective.
    • Solution: Genetic algorithms (GAs) provide a population-based search method inspired by biological evolution, suitable for optimizing retrieval pipelines over multiple parameters simultaneously. A GA evolves candidate configurations through selection, crossover, and mutation, guided by a fitness function, typically defined by retrieval performance metrics (eg, normalized discounted cumulative gain (NDCG), MRR, precision@k).

GA实现用于优化模态

GA implementation for optimizing modality

为了有效利用遗传算法进行检索优化,将理论概念转化为可实际复现的代码至关重要。以下章节将演示如何在检索环境中模拟进化过程:定义一个评估配置性能的适应度函数,并建立表示、初始化和操作候选解的机制。通过这种方法,我们可以迭代地优化检索流程,从而自动发现更优的参数组合。这种实践性的实现为可扩展的、数据驱动的优化奠定了基础,减少了人工干预,并支持在复杂的搜索环境中进行快速实验。

To effectively harness the potential of genetic algorithms for retrieval optimization, it is essential to translate theoretical concepts into practical, reproducible code. The following section demonstrates how to simulate the evolutionary process in a retrieval context by defining a fitness function that evaluates configuration performance, and by establishing mechanisms to represent, initialize, and manipulate candidate solutions. Through this approach, we can iteratively refine retrieval pipelines, allowing for automated discovery of superior parameter combinations. This hands-on implementation lays the groundwork for scalable, data-driven optimization, reducing manual intervention and enabling rapid experimentation in complex search environments.

导入随机数

import random

导入 numpy 库并将其命名为 np

import numpy as np

# 示例适应度函数(您需要将其替换为实际的检索评估)

# Sample fitness function (you'd replace this with actual retrieval evaluation)

def evaluate_config(text_weight, use_query_expansion) -> float:

def evaluate_config(text_weight, use_query_expansion) -> float:

# 占位符:基于超参数模拟适应度评分

# Placeholder: simulate a fitness score based on hyperparameters

得分 = 0.7 * 文本权重 + (如果使用查询扩展则为 0.2,否则为 0)

score = 0.7 * text_weight + (0.2 if use_query_expansion else 0)

噪声 = np.random.uniform(-0.05, 0.05)

noise = np.random.uniform(-0.05, 0.05)

返回分数 + 噪声

return score + noise

# 将个体编码为 [text_weight, query_expansion_flag]

# Encode individuals as [text_weight, query_expansion_flag]

def initialize_population(size=10):

def initialize_population(size=10):

返回 [[random.uniform(0.3, 0.9), random.choice([0, 1])] for _ in range(size)]

return [[random.uniform(0.3, 0.9), random.choice([0, 1])] for _ in range(size)]

def mutate(individual):

def mutate(individual):

如果 random.random() < 0.5:

if random.random() < 0.5:

individual[0] = min(1.0, max(0.0, individual[0] + random.uniform(-0.1, 0.1)))

individual[0] = min(1.0, max(0.0, individual[0] + random.uniform(-0.1, 0.1)))

别的:

else:

individual[1] = 1 - individual[1] # 切换查询扩展

individual[1] = 1 - individual[1] # toggle query expansion

返回个人

return individual

def crossover(p1, p2):

def crossover(p1, p2):

返回 [(p1[0] + p2[0]) / 2, random.choice([p1[1], p2[1]])]

return [(p1[0] + p2[0]) / 2, random.choice([p1[1], p2[1]])]

def select(pop, scores, k=4):

def select(pop, scores, k=4):

返回 [pop[i] for i in np.argsort(scores)[-k:]]

return [pop[i] for i in np.argsort(scores)[-k:]]

def genetic_optimization(generations=20, pop_size=10):

def genetic_optimization(generations=20, pop_size=10):

population = initialize_population(pop_size)

population = initialize_population(pop_size)

最佳配置 = 无

best_config = None

最佳得分 = -np.inf

best_score = -np.inf

for gen in range(generations):

for gen in range(generations):

得分 = [evaluate_config(*ind) for ind in population]

scores = [evaluate_config(*ind) for ind in population]

top_individuals = select(population, scores)

top_individuals = select(population, scores)

new_population = top_individuals[:]

new_population = top_individuals[:]

当 len(new_population) < pop_size:

while len(new_population) < pop_size:

p1, p2 = random.sample(top_individuals, 2)

p1, p2 = random.sample(top_individuals, 2)

child = mutate(crossover(p1, p2))

child = mutate(crossover(p1, p2))

new_population.append(child)

new_population.append(child)

人口 = 新人口

population = new_population

gen_best = max(scores)

gen_best = max(scores)

如果 gen_best > best_score:

if gen_best > best_score:

最佳得分 = gen_best

best_score = gen_best

best_config = population[np.argmax(scores)]

best_config = population[np.argmax(scores)]

print(f"第 {gen+1} 代:最佳得分 = {gen_best:.4f}")

print(f"Gen {gen+1}: Best Score = {gen_best:.4f}")

print("\n找到最优参数:")

print("\nOptimal Parameters Found:")

print(f"文本权重:{best_config[0]:.2f},查询扩展:{'开启' 如果 best_config[1] 则 '关闭'}")

print(f"Text Weight: {best_config[0]:.2f}, Query Expansion: {'On' if best_config[1] else 'Off'}")

返回最佳配置

return best_config

解释

Explanation

这种基于遗传算法的检索优化方法旨在解决检索过程中各阶段(模态融合、查询重构、上下文评分)参数交互的难题。与基于梯度的方法不同,遗传算法不需要可微损失函数,并且可以同时在离散和连续的搜索空间中进行搜索。在我们的实现中:

This GA-based retrieval optimization method addresses the challenge of parameter interaction across retrieval stages (modality fusion, query reformulation, contextual scoring). Unlike gradient-based methods, GAs do not require a differentiable loss and can navigate discrete and continuous search spaces simultaneously. In our implementation:

  • 每个个体编码一个检索配置向量:[ text_weight [0, 1], query_expansion_flag {0,1} ]。
  • Each individual encodes a retrieval configuration vector: [text_weight [0, 1], query_expansion_flag {0,1}].
  • 适应度函数使用模拟检索分数来评估性能,但在实践中,它会在验证查询集上计算 NDCG@k 或 recall@10。
  • The fitness function evaluates performance using a mock retrieval score, though in practice, this would compute NDCG@k or recall@10 on a validation query set.
  • 经过N代的进化过程,无需穷举搜索即可收敛到性能最佳的参数组合。
  • The evolutionary process over N generations converges toward the best-performing parameter combination without exhaustive search.

通过将遗传算法集成到检索流程中,系统可以随着时间的推移进行自我优化,适应特定领域的需求(例如,在时尚搜索中更注重图像嵌入,而在法律语料库中更注重文本元数据)。

By integrating GAs into the retrieval pipeline, systems can self-optimize over time, adapting to domain-specific needs (e.g., placing more emphasis on image embeddings in fashion search vs. textual metadata in legal corpora).

具有自适应索引刷新的多模态 RAG 系统

Multimodal RAG system with adaptive index refresh

从零开始搭建多模态 RAG 系统需要协调多个组件,以实现对文本和图像数据的智能质量保证。本指南提供分步说明。本文将详细介绍如何构建一个功能齐全的多模态 RAG 流水线,该流水线集成了基于 CLIP 的嵌入、用于向量存储的 ChromaDB 以及用于响应生成的 LangChain。该系统的一个关键特性是自适应索引刷新,它确保检索索引能够随着内容或嵌入模型的演变而保持最新。无论您是从原始文件开始,还是动态添加新数据,此设置都能使您的系统具备可扩展、上下文感知且准确的多模态搜索和生成能力。

Setting up a multimodal RAG system from scratch involves orchestrating multiple components to enable intelligent QA across both text and image data. This guide provides a step-by-step walkthrough for building a fully functional multimodal RAG pipeline, integrating CLIP-based embedding, ChromaDB for vector storage, and LangChain for response generation. A key feature of this system is adaptive index refresh, which ensures the retrieval index remains up-to-date with evolving content or embedding models. Whether you are starting with raw files or adding new data dynamically, this setup equips your system for scalable, context-aware, and accurate multimodal search and generation.

按照第 9 章“使用重排序构建 GenAI 系统”中给出的设置说明进行操作,并稍作修改。

Follow the setup instructions given in Chapter 9, Building GenAI Systems with Reranking, with minor changes.

下图展示了一个支持文本和图像输入的多模态 RAG 系统的架构。用户提交查询,查询内容根据输入模态使用文本或图像嵌入模型进行编码。文档和图像经过预处理并被分割成嵌入向量,存储在矢量数据库中。索引会定期刷新以保持与更新内容的同步。检索过程中,查询内容与存储的嵌入向量进行匹配,并将匹配结果最佳的模型传递给 LLM(逻辑逻辑模型),LLM 生成一个包含上下文信息的响应,并将其作为最终输出返回给用户,具体步骤将在后续章节中进行解释。

The following figure illustrates the architecture of a multimodal RAG system that supports both text and image inputs. A user submits a query, which is encoded using either a text or image embedding model, depending on modality. Documents and images are preprocessed and chunked into embeddings that are stored in a vector database. The index is periodically refreshed to maintain alignment with updated content. During retrieval, the query is matched against stored embeddings, and top results are passed to an LLM, which generates a contextually informed response that is returned to the user as the final output which is also explained in the following steps.

流程图展示了文本和图像嵌入模型如何处理查询、文档和图像,并将结果存储在多模态向量数据库中,然后由 LLM 进行搜索并生成结果。

图 10.1:具有自适应索引刷新的多模态 RAG 系统

Figure 10.1: Multimodal RAG system with adaptive index refresh

1. 刷新索引(随时或按需):刷新所有索引(例如,添加了新文件或模型已更新):

1. Refresh the index (anytime or on demand): To refresh all indexes (e.g., new files added, or model updated):

python run_refresh.py

python run_refresh.py

或者使用应用界面中的“刷新索引”按钮。

Or use the Refresh Indexes button in the app UI.

2. 目录结构:按照下图所示设置文件夹:

2. Directory structure: Setup your folder as shown in the following figure:

项目目录结构的屏幕截图,显示了文件夹和 Python 文件,包括 app.py 和 rag 子文件夹(其中包含配置、查询和生成模块),以及数据存储和文档文件夹。

图 10.2:项目文件夹结构

Figure 10.2: Folder structure of the project

以下列表是您的多模态 RAG 系统的完整端到端代码,该系统具有自适应索引刷新功能,并组织成清晰、模块化的.py文件:

The following list is the complete end-to-end code for your multimodal RAG system with adaptive index refresh, organized into clean, modular .py files:

  • config.py定义项目中使用的全局常量和配置设置。这包括 ChromaDB 持久化的目录路径、图像和文本集合名称、数据文件夹以及嵌入和 LLM 推理的模型名称。

    ####rag/config.py

    CHROMA_PERSIST_DIR = "chromadb_storage"

    CHROMA_IMAGE_COLLECTION = "laptop_images"

    CHROMA_TEXT_COLLECTION = "laptop_texts"

    IMAGE_FOLDER = "data/images"

    TEXT_FOLDER = "data/documents"

    EMBED_MODEL_NAME = "剪辑"

    MODEL_NAME = "llama3" # 适用于 Ollama LLM

  • The config.py defines global constants and configuration settings used across the project. This includes directory paths for ChromaDB persistence, image and text collection names, data folders, and model names for embedding and LLM inference.

    ####rag/config.py

    CHROMA_PERSIST_DIR = "chromadb_storage"

    CHROMA_IMAGE_COLLECTION = "laptop_images"

    CHROMA_TEXT_COLLECTION = "laptop_texts"

    IMAGE_FOLDER = "data/images"

    TEXT_FOLDER = "data/documents"

    EMBED_MODEL_NAME = "clip"

    MODEL_NAME = "llama3" # For Ollama LLM

  • embedding_utils.py提供了一些实用函数用于使用 CLIP 模型为文本和图像输入生成向量嵌入。这些嵌入对于使用一致的特征表示来填充和查询向量数据库至关重要。

    ####rag/embedding_utils.py

    from transformers import CLIPProcessor, CLIPModel

    导入 torch

    从 PIL 导入图像

    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def embed_text_ollama(text):

    inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)

    使用 torch.no_grad():

    outputs = clip_model.get_text_features(**inputs)

    返回 outputs[0].tolist()

    def embed_image_ollama(image_path):

    image = Image.open(image_path).convert("RGB")

    inputs = clip_processor(images=image, return_tensors="pt")

    使用 torch.no_grad():

    outputs = clip_model.get_image_features(**inputs)

    返回 outputs[0].tolist()

  • The embedding_utils.py provides utility functions to generate vector embeddings for both text and image inputs using the CLIP model. These embeddings are essential for populating and querying the vector database with consistent feature representations.

    ####rag/embedding_utils.py

    from transformers import CLIPProcessor, CLIPModel

    import torch

    from PIL import Image

    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def embed_text_ollama(text):

    inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():

    outputs = clip_model.get_text_features(**inputs)

    return outputs[0].tolist()

    def embed_image_ollama(image_path):

    image = Image.open(image_path).convert("RGB")

    inputs = clip_processor(images=image, return_tensors="pt")

    with torch.no_grad():

    outputs = clip_model.get_image_features(**inputs)

    return outputs[0].tolist()

  • loaders.py文件包含用于加载文本文件和从指定文件夹收集有效图像文件路径的函数。它在初始索引和自适应刷新操作期间用于读取原始输入数据。

    ###rag/loaders.py

    导入操作系统

    def load_text_documents(folder):

    文档 = {}

    for file in os.listdir(folder):

    如果文件以“txt”结尾:

    with open(os.path.join(folder, file), "r", encoding="utf-8") as f:

    docs[file] = f.read()

    返回文档

    def load_image_paths(folder):

    返回 [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

  • The loaders.py contains functions to load textual documents and collect valid image file paths from specified folders. It is used during both initial indexing and adaptive refresh operations to read the raw input data.

    ###rag/loaders.py

    import os

    def load_text_documents(folder):

    docs = {}

    for file in os.listdir(folder):

    if file.endswith(".txt"):

    with open(os.path.join(folder, file), "r", encoding="utf-8") as f:

    docs[file] = f.read()

    return docs

    def load_image_paths(folder):

    return [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

  • index_builder.py实现了在ChromaDB中构建初始索引的逻辑。它会删除现有集合,处理所有文档和图像,生成嵌入,并将它们及其元数据存储到单独的文本和图像集合中。

    ###rag/index_builder.py

    导入操作系统

    导入 chromadb

    from .embedding_utils import embed_text_ollama, embed_image_ollama

    从 .config 导入 *

    from .loaders import load_text_documents, load_image_paths

    def build_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    # 文本集

    如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:

    client.delete_collection(name=CHROMA_TEXT_COLLECTION)

    text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    texts = load_text_documents(TEXT_FOLDER)

    for idx, (fname, content) in enumerate(texts.items()):

    emb = embed_text_ollama(content)

    text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    # 图片集

    如果 CHROMA_IMAGE_COLLECTION 在 [c.name for c in client.list_collections()]:

    client.delete_collection(name=CHROMA_IMAGE_COLLECTION)

    image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    images = load_image_paths(IMAGE_FOLDER)

    for idx, path in enumerate(images):

    emb = embed_image_ollama(path)

    image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

  • The index_builder.py implements the logic for constructing the initial indexes in ChromaDB. It deletes existing collections, processes all documents and images, generates embeddings, and stores them along with their metadata into separate text and image collections.

    ###rag/index_builder.py

    import os

    import chromadb

    from .embedding_utils import embed_text_ollama, embed_image_ollama

    from .config import *

    from .loaders import load_text_documents, load_image_paths

    def build_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    # Text Collection

    if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:

    client.delete_collection(name=CHROMA_TEXT_COLLECTION)

    text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    texts = load_text_documents(TEXT_FOLDER)

    for idx, (fname, content) in enumerate(texts.items()):

    emb = embed_text_ollama(content)

    text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    # Image Collection

    if CHROMA_IMAGE_COLLECTION in [c.name for c in client.list_collections()]:

    client.delete_collection(name=CHROMA_IMAGE_COLLECTION)

    image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    images = load_image_paths(IMAGE_FOLDER)

    for idx, path in enumerate(images):

    emb = embed_image_ollama(path)

    image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

  • refresh.py启用自适应索引刷新功能。它会删除过时的文本和图像集合,并使用文档图像文件夹中的最新文件重建它们,从而确保系统始终与新增或修改的内容保持同步。

    ####rag/refresh.py

    导入操作系统

    导入 chromadb

    from .embedding_utils import embed_text_ollama, embed_image_ollama

    从 .config 导入 *

    from .loaders import load_text_documents, load_image_paths

    def refresh_text_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:

    client.delete_collection(CHROMA_TEXT_COLLECTION)

    collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    texts = load_text_documents(TEXT_FOLDER)

    for idx, (fname, content) in enumerate(texts.items()):

    emb = embed_text_ollama(content)

    collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    print(f"文本索引已刷新,包含{len(texts)}个文档。")

    def refresh_image_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    如果 CHROMA_IMAGE_COLLECTION 在 [c.name for c in client.list_collections()]:

    client.delete_collection(CHROMA_IMAGE_COLLECTION)

    collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    images = load_image_paths(IMAGE_FOLDER)

    for idx, path in enumerate(images):

    emb = embed_image_ollama(path)

    collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

    print(f"图片索引已刷新,包含{len(images)}张图片。")

    def refresh_all_indexes():

    refresh_text_index()

    refresh_image_index()

  • The refresh.py enables adaptive index refresh functionality. It deletes outdated text and image collections and rebuilds them using the latest files in the document and image folders, ensuring the system remains up-to-date with new or modified content.

    ####rag/refresh.py

    import os

    import chromadb

    from .embedding_utils import embed_text_ollama, embed_image_ollama

    from .config import *

    from .loaders import load_text_documents, load_image_paths

    def refresh_text_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:

    client.delete_collection(CHROMA_TEXT_COLLECTION)

    collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)

    texts = load_text_documents(TEXT_FOLDER)

    for idx, (fname, content) in enumerate(texts.items()):

    emb = embed_text_ollama(content)

    collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

    print(f" Text index refreshed with {len(texts)} documents.")

    def refresh_image_index():

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    if CHROMA_IMAGE_COLLECTION in [c.name for c in client.list_collections()]:

    client.delete_collection(CHROMA_IMAGE_COLLECTION)

    collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)

    images = load_image_paths(IMAGE_FOLDER)

    for idx, path in enumerate(images):

    emb = embed_image_ollama(path)

    collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

    print(f"Image index refreshed with {len(images)} images.")

    def refresh_all_indexes():

    refresh_text_index()

    refresh_image_index()

  • reranker.py使用交叉编码器模型,根据与查询的语义相似度重新评估和重新排序检索结果。这一重排序步骤利用更丰富的上下文比较,提高了最终结果的精确度

    ####rag/reranker.py

    from sentence_transformers import CrossEncoder

    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(query, metadatas):

    pairs = [(query, doc.get("file", "")) for doc in metadatas]

    scores = cross_encoder.predict(pairs)

    排名 = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)

    返回 [doc for doc, _ in ranking]

  • The reranker.py uses a cross-encoder model to re-evaluate and reorder retrieved results based on semantic similarity with the query. This reranking step improves the precision of final results by leveraging richer contextual comparisons.

    ####rag/reranker.py

    from sentence_transformers import CrossEncoder

    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(query, metadatas):

    pairs = [(query, doc.get("file", "")) for doc in metadatas]

    scores = cross_encoder.predict(pairs)

    ranked = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)

    return [doc for doc, _ in ranked]

  • generation.py初始化并返回一个基于 Ollama 的语言模型,用于生成式任务。当系统需要对用户查询生成合成的自然语言响应时,就会调用此文件。

    ###rag/generation.py

    from langchain_community.llms import Ollama

    from .config import MODEL_NAME

    def get_llm():

    返回 Ollama(model=MODEL_NAME, temperature=0.2)

  • The generation.py initializes and returns an Ollama-based language model for use in generative tasks. It is specifically called when the system needs to produce a synthesized natural language response to a user query.

    ###rag/generation.py

    from langchain_community.llms import Ollama

    from .config import MODEL_NAME

    def get_llm():

    return Ollama(model=MODEL_NAME, temperature=0.2)

  • run_once.py使用index_builder.py中函数执行一次性完整索引构建它通常在首次设置系统或需要完全重新索引时运行。

    ##### run_once.py

    from rag.index_builder import build_index

    如果 __name__ == "__main__":

    构建索引()

  • The run_once.py executes a one-time full index build using the functions in index_builder.py. It is typically run when setting up the system for the first time or when a complete reindexing is required.

    ##### run_once.py

    from rag.index_builder import build_index

    if __name__ == "__main__":

    build_index()

一次性或计划性索引刷新脚本

One-time or scheduled index refresh script

以下是不同类型的脚本:

The following are the different kinds of scripts:

  • run_refresh.py会触发refresh.py中定义的自适应索引刷新过程。它可以手动执行,也可以定期定时执行,以使索引与更新的图像或文档内容保持同步

    ####运行刷新.py

    from rag.refresh import refresh_all_indexes

    如果 __name__ == "__main__":

    refresh_all_indexes()

  • The run_refresh.py triggers the adaptive index refresh process defined in refresh.py. It is designed to be executed manually or scheduled periodically to keep the indexes in sync with updated image or document content.

    ####run_refresh.py

    from rag.refresh import refresh_all_indexes

    if __name__ == "__main__":

    refresh_all_indexes()

  • 刷新按钮app.py实现了基于 Streamlit 的 RAG 助手用户界面。它支持多种查询模式(图像、文本、混合),可以检索和重新排序结果,显示匹配的内容,并包含一个用于手动刷新索引的按钮。

    注意:您可以创建自己的用户界面;以下是一个示例。

    #### app.py

    import streamlit as st

    导入操作系统

    导入 chromadb

    from rag.embedding_utils import embed_text_ollama, embed_image_ollama

    from rag.reranker import rerank

    from rag.config import *

    from rag.generation import get_llm

    from rag.refresh import refresh_all_indexes

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    st.title("多模态 RAG 笔记本电脑助手")

    mode = st.radio("选择模式", ["图片 → 规格", "图片 + 文字 → 规格", "文字 → 图片 + 规格", "文字 → 生成的答案"])

    如果 st.button("刷新索引"):

    refresh_all_indexes()

    st.success("索引刷新成功!")

    # ...(此处内容与第 8 章和第 9 章的 `app.py` 查询处理逻辑相同)...

    此设置将自适应索引刷新集成到您现有的多模态 RAG 管道中。

  • Refresh button: The app.py implements the Streamlit-based user interface for the RAG assistant. It supports multiple query modes (image, text, hybrid), retrieves and reranks results, displays matched content, and includes a button to manually refresh the indexes.

    Note: You may create your own UI; the following is a sample example.

    #### app.py

    import streamlit as st

    import os

    import chromadb

    from rag.embedding_utils import embed_text_ollama, embed_image_ollama

    from rag.reranker import rerank

    from rag.config import *

    from rag.generation import get_llm

    from rag.refresh import refresh_all_indexes

    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

    st.title("Multimodal RAG Laptop Assistant")

    mode = st.radio("Choose Mode", ["Image → Specs", "Image + Text → Specs", "Text → Image + Specs", "Text → Generated Answer"])

    if st.button("Refresh Indexes"):

    refresh_all_indexes()

    st.success("Indexes refreshed successfully!")

    # ... (same content as the Chapter_8 and 9 `app.py` query handling logic here) ...

    This setup integrates adaptive index refresh into your existing multimodal RAG pipeline.

利用自适应刷新增强多模态 RAG

Enhancing multimodal RAG with adaptive refresh

这种综合方法通过集成前沿的表示和检索技术,增强了多模态 RAG 流水线的功能。通过结合自适应索引刷新、多向量嵌入和统一向量数据库,该系统能够高效、精确地处理各种输入模态和查询类型。通过精心编排文本和图像嵌入,以及运用复杂的重排序技术,该架构为构建高级 AI 助手奠定了坚实的基础。以下章节将详细介绍该系统的底层流水线、存储架构和检索机制。

This comprehensive approach enhances the capabilities of your multimodal RAG pipeline by integrating cutting-edge methods for both representation and retrieval. By combining adaptive index refresh, multi-vector embeddings, and a unified vector database, the system is able to handle a wide range of input modalities and query types with efficiency and precision. Through careful orchestration of text and image embeddings, as well as sophisticated reranking techniques, the architecture serves as a robust foundation for building advanced AI assistants. The following sections further detail the underlying pipeline, storage architecture, and retrieval mechanisms that power this system.

完整的端到端代码可在本书的 GitHub 代码库中找到。请参考第六章“两阶段和多阶段 GenAI 系统”中列出的多向量表示概念。该系统使用 Qdrant 将密集文本嵌入和多向量文本嵌入以及图像嵌入集成到一个统一的向量数据库中。它支持多模态检索和词元级重排序,并利用自适应嵌入刷新机制来确保数据一致性。该架构展示了一个具有后期交互、多模态上下文和局部 LLM 推理的混合 RAG 流水线的实际应用示例。

The end-to-end code can be found in the GitHub repository of the book. Please refer to the multi-vector representation concept listed in Chapter 6, Two and Multi-stage GenAI Systems. This system integrates dense and multi-vector text embeddings along with image embeddings into a unified vector database using Qdrant. It supports multimodal retrieval and token-level reranking and leverages an adaptive embedding refresh mechanism to ensure data consistency. The architecture exemplifies a practical implementation of a hybrid RAG pipeline with late interaction, multimodal context, and local LLM reasoning.

下图展示了一个稳健的 RAG 流水线,旨在高效处理和响应用户使用文本和图像输入的查询。该系统通过为每种模态利用专用嵌入模型,并将它们存储在统一的向量数据库中,从而支持混合语义搜索和检索。定期刷新索引可确保新导入的文档和图像及时更新到数据库中。检索结果在传递给 LLM 进行最终答案生成之前,会经过基于多向量的重排序,从而实现准确且上下文相关的多模态响应。

The following figure illustrates a robust RAG pipeline designed to efficiently process and respond to user queries using both text and image inputs. By leveraging dedicated embedding models for each modality and storing them in a unified vector database, the system supports hybrid semantic search and retrieval. Periodic index refreshes ensure that newly ingested documents and images are reflected in the database. Retrieved results undergo multi-vector-based reranking before being passed to an LLM for final answer generation, enabling accurate and context-aware multimodal responses.

流程图展示了文本和图像嵌入模型处理的查询过程,这些模型将文档和图像索引到向量数据库中。系统使用大型语言模型(LLM)搜索向量、对结果进行排序并生成响应。

图 10.3:多模态 RAG 重排序流程

Figure 10.3: Multimodal RAG with reranking flow

Qdrant中的向量嵌入管道和存储

Vector embedding pipeline and storage in Qdrant

该系统为每个文本和图像配对文档生成并存储三种类型的矢量表示:

The system generates and stores three types of vector representations for each paired text and image document:

  • 密集文本嵌入(dense_text):首先使用基于句子转换器的 BAAI/bge-small-en 模型对文本内容进行嵌入。这将生成一个 384 维的向量,表示文本的整体含义。这些密集向量用于检索的初始阶段,该阶段优先考虑检索速度。
  • Dense text embedding (dense_text): The textual content is first embedded using the BAAI/bge-small-en model, implemented via Sentence Transformers. This produces a single 384-dimensional vector representing the overall meaning of the text. These dense vectors are used for the initial stage of retrieval, where speed is prioritized.
  • ColBERT 多向量嵌入 (colbert_text) :除了密集向量之外,该系统还使用 colbert-ir/colbertv2.0模型,通过LateInteractionTextEmbedding生成词元级嵌入。与将整个文档压缩成单个向量的传统方法不同,这种多向量方法为每个重要的词元或短语保留一个向量。这些向量使用多向量配置(带有MAX_SIM比较器的MultiVectorConfig )存储在 Qdrant 中,从而支持后期交互重排序。因此,每个文档都由一组 128 维的词元向量表示。
  • ColBERT multi-vector embedding (colbert_text): In addition to dense vectors, the system generates token-level embeddings using the colbert-ir/colbertv2.0 model through LateInteractionTextEmbedding. Unlike traditional approaches that collapse an entire document into a single vector, this multi-vector approach retains a vector for each significant token or phrase. These vectors are stored in Qdrant using a multi-vector configuration (MultiVectorConfig with MAX_SIM comparator) that enables late interaction reranking. Each document is therefore represented by a set of 128-dimensional token vectors.
  • 图像嵌入(图像) :关联的图像文件使用 CLIP 模型( openai/clip-vit-base-patch32 )进行嵌入,该模型将视觉内容转换为 512 维向量。这使得基于相似性的图像检索成为可能。
  • Image embedding (image): The associated image file is embedded using the CLIP model (openai/clip-vit-base-patch32), which converts the visual content into a 512-dimensional vector. This allows for similarity-based image retrieval.
  • 插入到统一集合中:所有三种类型的嵌入(密集文本、多矢量文本和图像)都以不同的矢量字段插入到同一个 Qdrant 集合中。每条记录都使用通用唯一标识符( UUID ) 进行唯一标识,并包含文件名和原始文本等元数据。
  • Insertion into a unified collection: All three types of embeddings, dense text, multi-vector text, and image, are inserted into a single Qdrant collection under separate vector fields. Each record is uniquely identified using a Universal Unique Identifier (UUID) and contains metadata such as the filename and raw text.

这种设计形成了一个统一的向量存储,支持多模态检索(文本和图像)和多向量重排序(标记级精度)。

This design results in a unified vector store that supports multimodal retrieval (text and image) and multi-vector reranking (token-level precision).

两阶段检索和多向量重排序

Two-stage retrieval and multi-vector reranking

为了兼顾速度和准确性,检索过程分为两个阶段:

The retrieval process is divided into two-stages to balance speed and accuracy:

  • 第一阶段,密集检索:初始检索使用dense_text向量进行。查询使用相同的 BAAI/bge-small-en 模型嵌入到单个密集向量中,并使用 Qdrant 中的 HNSW 索引执行快速相似性搜索。
  • Stage 1, dense retrieval: The initial retrieval is performed using the dense_text vectors. The query is embedded into a single dense vector using the same BAAI/bge-small-en model, and a fast similarity search is executed using the HNSW index in Qdrant.

prefetch = models.Prefetch(query=dense_query, using="dense_text")

prefetch = models.Prefetch(query=dense_query, using="dense_text")

  • 第二阶段,多向量重排序:在重排序阶段,使用 ColBERT 模型对同一查询进行词元级嵌入。然后,使用 MAX_SIM 算子将此词元级嵌入与存储的colbert_text多向量进行比较, MAX_SIM算子为每个查询词元选择最佳匹配的文档词元。这使得对初始候选集进行细粒度重排序成为可能。

    查询=colbert_query,

    使用="colbert_text",

  • Stage 2, multi-vector reranking: In the reranking stage, the same query is embedded at the token-level using the ColBERT model. This token-level embedding is compared with the stored colbert_text multi-vectors using the MAX_SIM operator, which selects the best matching document tokens for each query token. This enables fine-grained reranking of the initial candidate set.

    query=colbert_query,

    using="colbert_text",

  • 可选的基于图像的检索:如果提供了图像向量,则会使用该图像向量字段执行单独的相似性搜索。这使得可以仅基于视觉相似性或结合文本进行检索。
  • Optional, image-based retrieval: If an image vector is provided, a separate similarity search is executed using the image vector field. This enables retrieval based on visual similarity alone or in combination with text.

上下文组装和语言生成

Context assembly and language generation

选定排名靠前的文档后,提取其文本内容并将其连接起来形成上下文字符串。然后,使用 LangChain 中的 ReAct 式提示将此上下文连同原始查询一起传递给本地 LLM(通过 Ollama 连接到 Mistral):

Once the top-ranked documents are selected, their textual content is extracted and concatenated to form a context string. This context is passed, along with the original query, to a local LLM (Mistral via Ollama) using a ReAct-style prompt in LangChain:

Python

python

编辑

CopyEdit

response = chain.run({"query": query_text, "context": context})

response = chain.run({"query": query_text, "context": context})

LLM 综合上下文并返回自然语言响应。

The LLM synthesizes the context and returns a natural language response.

自适应嵌入刷新机制

Adaptive embedding refresh mechanism

该系统包含一个自适应刷新功能,可以扫描指定的文本图像文件夹。它检测有效的.txt.jpg文件对,生成所有必要的嵌入,并将它们插入到 Qdrant 中。

The system includes an adaptive refresh function that scans a specified text and images folder. It detects valid .txt and .jpg file pairs, generates all necessary embeddings, and upserts them into Qdrant.

该过程具有以下适应性:

This process is adaptive in the following ways:

  • 它反映了数据文件夹的当前内容。
  • It reflects the current contents of the data folders.
  • 它会自动为新文件生成嵌入。
  • It automatically generates embeddings for new files.
  • 如果缺少所需的图像文件,则可以避免重新处理。
  • It avoids reprocessing if a required image file is missing.
  • 但是,它目前不会根据文件名进行去重或覆盖。每次插入都会使用一个新的 UUID。
  • However, it currently does not deduplicate or overwrite based on file names. Each insert uses a new UUID.

这种刷新机制确保 Qdrant 与最新数据集保持同步,使其适用于文档定期更改的环境(例如,每周更新)。

This refresh mechanism ensures that Qdrant stays up-to-date with the latest dataset, making it suitable for environments where documents change regularly (e.g., weekly updates).

索引行为

Indexing behavior

该集合针对每种向量类型配置了不同的索引策略:

The collection is configured with different indexing strategies per vector type:

  • dense_text :已启用 HNSW 索引以加快检索速度。
  • dense_text: HNSW indexing is enabled for fast retrieval.
  • colbert_text :为了支持使用MAX_SIM进行精确重排序,已禁用 HNSW 索引
  • colbert_text: HNSW indexing is disabled to support precise reranking using MAX_SIM.
  • 图片:HNSW 索引已启用,可用于基于相似性的图像搜索。
  • image: HNSW indexing is enabled for similarity-based image search.

这种配置对于依赖快速预选和准确后期交互重排序的两阶段检索系统来说是最佳的。

This configuration is optimal for a two-stage retrieval system that relies on fast preselection and accurate late interaction reranking.

待办事项

To do

完成自适应索引刷新集成后,我们鼓励您通过实施其他检索优化技术来扩展此系统。首先,可以加入查询扩展功能,利用同义词或释义来提高召回率。其次,添加基于模态的路由,以便根据输入类型动态地将查询定向到相应的索引。第三,在相似度比较之前实现嵌入归一化,并尝试使用加权嵌入融合来平衡多模态输入。第四,集成分数融合和聚合功能,以便合并来自多个来源的结果。最后,利用时间戳或可靠性等元数据来增强上下文过滤。这些新增功能将显著提高系统的相关性、鲁棒性和适应性。

After completing the adaptive index refresh integration, you are encouraged to extend this system by implementing additional retrieval optimization techniques. Start by incorporating query expansion to improve recall using synonyms or paraphrasing. Add modality-based routing to dynamically direct queries to the appropriate index based on input type. Implement embedding normalization before similarity comparisons and experiment with weighted embedding fusion to balance multimodal inputs. Integrate score fusion and aggregation for combining results from multiple sources. Finally, enhance contextual filtering using metadata such as timestamps or reliability. These additions will significantly improve the relevance, robustness, and adaptability of your system.

结论

Conclusion

本章全面概述了检索优化技术,着重探讨了传统和现代检索系统的根本缺陷。我们研究了诸如基于模态的路由、查询扩展、分数融合和自适应索引刷新等针对性策略如何缓解这些限制。通过详细的设计原则和模块化的Python实现,我们演示了如何实现自适应索引刷新。我们展示了一个功能齐全的代码库,其中包含ChromaDB、CLIP嵌入和Streamlit接口,最终构成了一个自适应索引流水线。读者现在掌握了概念理解和实用工具,可以将此框架扩展到实际应用中,并添加其他优化技术。下一章,我们将实现以语音为输入的多模态GenAI系统。

This chapter provided a comprehensive overview of retrieval optimization techniques, addressing the fundamental drawbacks of traditional and modern retrieval systems. We explored how targeted strategies, such as modality-based routing, query expansion, score fusion, and adaptive index refresh, mitigate these limitations. Through detailed design principles and modular Python implementations, we demonstrated how to implement adaptive index refresh. A fully functional codebase featuring ChromaDB, CLIP embeddings, and a Streamlit interface was presented, culminating in an adaptive indexing pipeline. Readers are now equipped with both conceptual understanding and practical tools to extend this framework with additional optimization techniques for real-world applications. In the next chapter, we will implement multimodal GenAI systems with voice as input.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

第11构建以语音为输入的多模态GenAI系统

CHAPTER 11Building Multimodal GenAI Systems with Voice as Input

介绍

Introduction

本章探讨如何将语音作为主要输入模式添加到多模态生成式人工智能GenAI )系统中。传统上,这类系统依赖文本或视觉输入,但如今,语音输入正日益受到重视,以增强可访问性、自然交互和用户参与度。图示流程展示了一个无缝衔接的过程:用户通过键盘或语音输入的查询会依次经过检索增强生成RAG )聊天机器人。语音输入在集成前会进行语音转文本STT )转换。系统随后会检查向量数据库以查找相关上下文。如果找到,则将上下文传递给 Mistral大型语言模型LLM )以生成答案。如果未找到,流程会动态地回退到网络搜索,为响应合成提供足够的依据。最后,生成的答案可以选择性地转换为语音,从而以语音输出完成多模态闭环。该架构凸显了 GenAI 界面的日益完善,它将语音、文本、检索和生成整合到一个强大且以用户为中心的交互模型中。

This chapter explores adding speech as a primary input mode to multimodal generative AI (GenAI) systems. Traditionally reliant on text or visual input, such systems are increasingly embracing voice to enhance accessibility, natural interaction, and user engagement. The illustrated pipeline introduces a seamless flow where user queries—via keyboard or voice—are routed through a retrieval-augmented generation (RAG) chatbot. Voice input undergoes a speech-to-text (STT) transformation before integration. The system then checks a vector database for relevant context. If found, the context is passed to a Mistral large language model (LLM) for answer generation. If not, the pipeline dynamically falls back on a web search to provide sufficient grounding for response synthesis. Finally, generated answers are optionally converted to speech, closing the multimodal loop with a voice-based output. This architecture highlights the growing sophistication of GenAI interfaces, unifying speech, text, retrieval, and generation into a robust, user-centric interaction model.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • RAG 超越图像和文本 RAG
  • RAG beyond image and text RAG
  • 概念
  • Concepts
  • 将语音接口集成到 RAG 架构中
  • Integrating speech interfaces into RAG architecture
  • 语音驱动的 RAG 系统的代码实现
  • Code implementation of the voice-enabled RAG system

目标

Objectives

本章旨在设计并实现一个支持语音的多模态 RAG 系统,该系统集成了语音、文档检索、网络搜索回退以及基于本地 LLM 的响应生成功能。通过同时启用语音转文本 (STT) 和文本转语音( TTS ) 功能,该系统旨在创建一个更自然、更易用且更贴合上下文的对话式界面。该解决方案结合了模块化的 LangChain 组件、LangGraph 编排、Ollama 托管的 LLM 以及 Streamlit UI,能够从本地 PDF 文件或网络生成符合语境的响应本章展示语音如何作为高级 RAG 架构中的主要输入方式,从而提升其在实际多模态应用中的可用性。

The objective of this chapter is to design and implement a voice-enabled multimodal RAG system that integrates speech, document retrieval, web search fallback, and local LLM-based response generation. By enabling both STT and text-to-speech (TTS) capabilities, the system aims to create a more natural, accessible, and context-aware conversational interface. The solution combines modular LangChain components, LangGraph orchestration, Ollama-hosted LLMs, and Streamlit UI to deliver grounded responses from local Portable Document Format (PDFs) or the web. This chapter demonstrates how speech can serve as a primary input modality in advanced RAG architectures, enhancing usability across real-world, multimodal applications.

RAG 超越图像和文本 RAG

RAG beyond image and text RAG

RAG 已成为将语言学习模型 (LLM) 与外部知识相结合的强大范式。早期的 RAG 系统主要在文本领域运行,随后发展到整合图像等视觉模态,而近期的进展则要求将其扩展到更广泛的模态。一个真正的多模态 RAG 系统能够整合多种数据类型,包括但不限于音频(语音)、视频、传感器数据、表格输入和结构化知识图谱,从而实现跨领域更丰富、更具上下文关联性的生成。

RAG has emerged as a powerful paradigm for grounding LLMs in external knowledge. While early implementations of RAG systems predominantly operated in the textual domain and subsequently evolved to incorporate visual modalities such as images, recent advancements call for an expansion toward a broader spectrum of modalities. A truly multimodal RAG system integrates diverse data types, including but not limited to audio (speech), video, sensor data, tabular inputs, and structured knowledge graphs, enabling richer and more contextually grounded generation across domains.

在这个更广泛的多模态 RAG 框架中,查询可以来自各种输入通道:语音(转换为文本)、手势(通过姿态估计进行解释)、空间数据(通过激光雷达( LiDAR ) 或物联网( IoT ) 传感器获取)或实时环境中的用户交互(例如,增强现实和虚拟现实( AR/VR ) 设置)。因此,检索机制必须处理异构索引结构——嵌入数据库、图数据库或结构化数据仓库,每种结构都代表不同的模态特定嵌入空间。这需要模态感知检索器或跨模态对齐技术来确保语义一致的检索。

In this broader multimodal RAG framework, queries can originate from various input channels: speech (converted to text), gesture (interpreted via pose estimation), spatial data (via light detection and ranging (LiDAR) or Internet of Things (IoT) sensors), or user interactions in real-time environments (e.g., augmented reality and virtual reality (AR/VR) settings). The retrieval mechanism must therefore operate over heterogeneous index structures—embedding databases, graph databases, or structured warehouses, each representing a different modality-specific embedding space. This requires either modality-aware retrievers or cross-modal alignment techniques to ensure semantically coherent retrieval.

生成模块通常由基础模型(例如 Mistral、GPT 或 Gemini)驱动,然后利用注意力加权后期融合、嵌入拼接或上下文评分等融合技术,整合这些检索到的上下文信息(可能跨模态)。此类架构支持从多模态对话代理和智能辅导系统到物理环境中的自主代理等各种应用。因此,将 RAG 扩展到图像-文本融合之外,为将 LLM 应用于复杂的现实世界信息生态系统开辟了新的领域。

The generation module, typically powered by a foundation model (e.g., Mistral, GPT, or Gemini), then integrates these retrieved contexts, potentially across modalities, using fusion techniques such as attention-weighted late fusion, embedding concatenation, or contextual scoring. Such architectures enable applications ranging from multimodal conversational agents and intelligent tutoring systems to autonomous agents in physical environments. Thus, expanding RAG beyond image-text fusion unlocks new frontiers for grounding LLMs in complex, real-world information ecosystems.

在查询时,用户查询被编码成一个向量,并使用近似最近邻ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。

At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.

这些重新排序后的文档连同原始用户查询一起被送入逻辑逻辑模型(LLM)进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。

These reranked documents, along with the original user query, are passed into the LLM for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.

概念

Concepts

语音转文本 (STT) 和文本转语音 (TTS) 技术是开发语音多模态人工智能系统的基础组件。这些技术通过语音输入和音频输出实现自然语言交互,显著提升了系统的可访问性、免提操作性和用户参与度,尤其是在视觉或触觉输入受限的环境中。

STT and TTS technologies serve as foundational components in the development of voice-enabled multimodal AI systems. By enabling natural language interaction through spoken input and auditory output, these technologies significantly enhance accessibility, hands-free operation, and user engagement, especially in environments where visual or tactile input may be constrained.

让我们来了解一下语音对话系统的核心组件:语音转录 (STT) 和文本转语音 (TTS)。这些技术构建了人类语音和机器智能之间的双向桥梁,实现了自然、直观的交互。STT 将语音输入转录成机器可读的文本,作为下游 AI 模型的听觉入口。TTS 则赋予系统响应以语音,将生成的文本合成出类似人类的语音。它们共同实现了现代对话式 AI 流程中无缝的端到端语音交互。详情如下:

Let us understand the core components that power voice-enabled conversational systems STT and TTS. These technologies form the bidirectional bridge between human speech and machine intelligence, enabling natural, intuitive interactions. STT transcribes spoken input into machine-readable text, acting as the auditory gateway to downstream AI models. TTS, in turn, gives voice to the system’s responses, synthesizing human-like speech from generated text. Together, they enable seamless, end-to-end voice interaction in modern conversational AI pipelines. Details are as follows:

  • 语音转文本 ( STT ):这些系统利用自动语音识别( ASR ) 模型将口语转换为机器可读文本。现代 ASR 模型基于深度神经网络构建,通常采用基于注意力机制的编码器-解码器架构或符合模型,使其能够捕捉时间依赖性并适应各种口音、语速和声学环境。STT 作为语音查询系统的第一阶段使能器,允许将自然口语路由到下游组件,例如基于 RAG 的语言学习模型 (LLM)。
  • STT: The systems leverage automatic speech recognition (ASR) models to convert spoken language into machine-readable text. Modern ASR models are built upon deep neural networks, often incorporating attention-based encoder-decoder architectures or conformer models, allowing them to capture temporal dependencies and accommodate various accents, speech rates, and acoustic environments. STT acts as the first-stage enabler in voice-based query systems, allowing natural spoken language to be routed into downstream components such as RAG-based LLMs.
  • 文本转语音(TTS ):它使用神经声码器(例如 WaveNet、HiFi-GAN)或端到端基于 Transformer 的模型(例如 FastSpeech)将模型生成的文本响应转换为自然流畅的语音。TTS 系统旨在优化语音的清晰度、韵律和情感表达,确保输出的语音听起来像真人语音,并且符合语境。
  • TTS: It transforms the model-generated textual response into natural-sounding speech using neural vocoders (e.g., WaveNet, HiFi-GAN) or end-to-end transformer-based models (e.g., FastSpeech). TTS systems aim to optimize intelligibility, prosody, and emotional expressiveness, ensuring output speech feels human-like and contextually appropriate.

STT 和 TTS 共同构成一个闭环反馈系统,将用户语音转换为可操作的机器输入,并提供合成语音输出,从而完成对话式人工智能中的听觉界面循环。

Together, STT and TTS form a closed feedback loop, converting user speech into actionable machine input and delivering synthesized voice output, thereby completing the auditory interface cycle in conversational AI.

将语音接口集成到 RAG 架构中

Integrating speech interfaces into RAG architecture

将语音合成(STT)和文本转语音(TTS)技术集成到 RAG 流程中,扩展了生成式系统的功能,使其超越了传统的基于文本的界面,从而实现更自然、多模态的人机交互。这种语音增强技术在虚拟助手、辅助功能系统以及在真实环境中运行的具身人工智能代理等应用中尤为重要。

The integration of STT and TTS technologies into RAG pipelines extends the capabilities of generative systems beyond traditional text-based interfaces, enabling more natural and multimodal human-computer interaction. This voice augmentation is particularly impactful in applications such as virtual assistants, accessibility systems, and embodied AI agents operating in real-world environments.

在支持语音的 RAG 流程中,STT 模块作为入口点,将用户的语音输入转录为结构化文本。转录后的文本随后被路由至核心 RAG 流程,用于针对向量数据库或其他知识库进行语义检索。检索到的文档与输入查询连接起来,并传递给语言学习模型(LLM),例如 Mistral、GPT 或 Llama,由其生成上下文相关的响应。

In a voice-enabled RAG pipeline, STT modules serve as the entry point, transcribing spoken user input into structured text. This transcribed text is then routed through the core RAG pipeline, where it is used to perform semantic retrieval against a vector database or other knowledge source. Retrieved documents are concatenated with the input query and passed to an LLM, such as Mistral, GPT, or Llama, which generates a contextually grounded response.

在生成响应之后,输出阶段采用TTS系统,根据LLM的文本输出合成自然语音。这就完成了基于语音的交互循环,以人耳可听懂的方式提供对话式响应。

Following response generation, TTS systems are employed at the output stage to synthesize natural speech from the textual output of the LLM. This completes the voice-based interaction loop, delivering conversational responses in a human-auditory format.

这种双向语音集成不仅提升了用户体验,也带来了延迟、流式推理和实时纠错方面的挑战。解决这些问题需要精心协调异步输入/输出( I/O )、快速的语音转文本/语音合成 (STT/TTS) 推理引擎,以及用于低置信度语音识别或生成输出的回退机制。

This bidirectional speech integration not only enhances user experience but also introduces challenges in latency, streaming inference, and real-time error correction. Addressing these requires careful orchestration of asynchronous input/output (I/O), fast STT/TTS inference engines, and fallback mechanisms for low-confidence speech recognition or generation outputs.

下图展示了一个多模态语音 RAG 管道,该管道旨在处理智能问答( QA ) 系统中的键盘和语音输入:

The following figure illustrates a multimodal voice-enabled RAG pipeline designed to handle both keyboard and voice inputs in an intelligent question answering (QA) system:

流程图显示输入内容(文本、语音或文档)如何嵌入到矢量数据库中并进行搜索。如果用户未作答,则触发网络搜索;如果用户作答,则逻辑逻辑模型 (LLM) 会为用户生成搜索结果。

图 11.1:支持语音的多模态 RAG 管道,集成语音

Figure 11.1: Voice-enabled multimodal RAG pipeline integrating speech

该流程始于用户输入,用户可以通过键盘输入或语音输入。对于语音查询,系统首先进行语音转文本 (STT) 转换,将语音转录为文本形式。无论输入方式如何,问题路由模块都会确保所有输入均被规范化,并以统一格式发送到下游。

The process begins with user input, which can be entered either through a keyboard or captured via voice. For spoken queries, the system first performs STT conversion, transcribing the spoken words into textual form. Regardless of input modality, the question routing module ensures that all inputs are normalized and sent downstream in a unified format.

接下来,RAG聊天机器人会处理查询,它会执行向量数据库查找,以检查相关的上下文知识是否已嵌入并可检索。如果找到上下文,则会将其直接传递给Mistral LLM,后者使用该上下文生成基于语境的响应。如果未找到相关上下文,系统默认采用基于网络搜索的检索方式,以确保LLM仍然能够获得足够的语境信息。

Next, the query is handled by the RAG chatbot, which performs a vector database lookup to check whether relevant contextual knowledge is already embedded and retrievable. If context is found, it is passed directly to the Mistral LLM, which uses this context to generate a grounded response. If no relevant context is located, the system defaults to web search-based retrieval, ensuring the LLM still receives sufficient grounding information.

Mistral LLM 生成的答案可以选择性地使用 TTS 合成技术转换回音频,从而提供与原始输入模态一致的语音输出。这种闭环流程展示了现代多模态系统如何集成检索、生成和语音技术,以提供直观、易用的对话式 AI 体验。

The generated answer, produced by the Mistral LLM, can then optionally be converted back into audio using TTS synthesis, providing a spoken output that aligns with the original input modality. This closed-loop pipeline exemplifies how modern multimodal systems can integrate retrieval, generation, and speech technologies to deliver intuitive, accessible conversational AI experiences.

语音驱动的 RAG 系统的代码实现

Code implementation of the voice-enabled RAG system

图 11.2展示了一个支持语音的多模态 RAG 聊天机器人系统的高级项目目录结构。该架构将语言建模、向量检索、提示工程和语音处理等关键功能模块化。这种组织结构支持可扩展性,并实现了清晰的职责分离,涵盖了从文档导入和嵌入到实时语音交互和前端部署的各个环节。

Figure 11.2 presents the high-level project directory structure for a voice-enabled multimodal RAG chatbot system. The architecture modularizes key functionalities such as language modeling, vector retrieval, prompt engineering, and voice processing. This organization supports extensibility and clear separation of concerns, ranging from document ingestion and embedding to real-time speech interaction and frontend deployment.

名为 rag_chatbot_with_speach 的项目的目录树,显示了嵌入、加载器和语音等功能的文件夹和 Python 文件,以及数据文件夹、.env 文件和 requirements.txt 文件。

图 11.2:支持语音的多模态 RAG 机器人结构

Figure 11.2: Voice-enabled multimodal RAG bot structure

技术栈概述

Tech stack overview

该系统利用精心设计的技术栈,旨在支持模块化、本地优先和语音支持的 RAG 工作流程,如下表所示:

The system leverages a carefully curated technology stack designed to support modular, local-first, and speech-enabled RAG workflows, as outlined in the following table:

成分

Component

描述

Description

朗链

LangChain

作为 RAG 的骨干,它支持在模块化管道中实现提示模板、文档加载和 LLM 编排。

Serves as the backbone for RAG, enabling prompt templating, document loading, and LLM orchestration in a modular pipeline.

LangGraph

LangGraph

提供基于图的执行模型,用于管理条件流(例如,回退到 Web 搜索)和查询路径的动态路由。非常适合管理复杂的查询状态。

Provides a graph-based execution model to manage conditional flows (e.g., fallback to web search) and dynamic routing of query paths. Ideal for managing complex query states.

奥拉玛

Ollama

支持本地LLM(例如Mistral或Llama),无需外部应用程序编程接口API )调用即可实现快速离线推理。支持自定义模型集成和GPU加速。

Hosts local LLMs such as Mistral or Llama, enabling fast, offline inference without external application programming interface (API) calls. Supports custom model integration and GPU acceleration.

Streamlit

Streamlit

为基于 Web 的前端用户界面提供支持,使用户能够通过简洁、响应迅速的界面与聊天机器人进行交互。支持实时语音和文本输入。

Powers the web-based frontend UI, enabling users to interact with the chatbot via a clean, reactive interface. Supports real-time voice and text inputs.

Tavily API

Tavily API

当在本地向量数据库中找不到相关上下文时,可作为实时网络搜索的备用方案,确保响应始终基于最新的外部知识。

Acts as a live web search fallback when no relevant context is found in the local vector database, ensuring responses remain grounded in up-to-date external knowledge.

诺米克嵌入

Nomic Embeddings

用于将导入的文档转换为适合在向量数据库中进行相似性搜索的高维向量表示。

Used to convert ingested documents into high-dimensional vector representations suitable for similarity search in the vector database.

pyttsx3

pyttsx3

可在客户端启用 TTS 转换,以完全离线、平台无关的方式从 LLM 输出生成可听见的响应。

Enables TTS conversion on the client side, generating audible responses from LLM outputs in a fully offline, platform-agnostic manner.

语音识别

SpeechRecognition

使用本地麦克风流捕获语音输入并将其转录为文本,充当系统的 STT 引擎。

Captures and transcribes voice input into text using local microphone streams, acting as the system’s STT engine.

表 11.1:支持语音的多模态 RAG 聊天机器人的关键组件

Table 11.1: Key components of the voice-enabled multimodal RAG chatbot

该集成堆栈支持端到端的多模态对话式 AI 流水线,能够进行本地推理、动态检索、实时语音交互和回退增强。

This integrated stack supports an end-to-end multimodal conversational AI pipeline that is capable of local inference, dynamic retrieval, real-time speech interaction, and fallback augmentation.

前端

Frontend

该系统采用基于 Streamlit 的极简前端,用户可以通过键盘输入或实时语音查询与多模态 RAG 聊天机器人进行交互。界面会显示转录的语音,动态检索相关上下文,并提供带有来源信息的可靠答案。

The system features a minimalist Streamlit-based frontend that enables users to interact with the multimodal RAG chatbot using either keyboard input or real-time voice queries. The interface displays transcribed speech, dynamically retrieves relevant context, and presents grounded answers with source attribution.

如下图所示,前端界面是一个使用 Streamlit 实现的多模态 RAG 聊天机器人,它为用户提供基于 Web 的用户界面( UI ),用户可以通过键盘或语音与系统进行交互。脚本 ( app.py ) 集成了各种后端组件。在模块化、实时工作流程中,对用户输入、检索、生成和语音功能进行模块化和协调。

The frontend interface, as shown in the following figure, is a multimodal RAG chatbot that is implemented using Streamlit, providing users with a web-based user interface (UI) to interact with the system via either keyboard or voice. The script (app.py) integrates various backend modules and coordinates user input, retrieval, generation, and speech functionalities in a modular, real-time workflow.

这是名为“多模态 RAG 聊天机器人(PDF + 网页 + 语音)”的聊天机器人界面截图。用户询问旧金山的天气,聊天机器人回复了多云天气预报和温度。

图 11.3:支持语音功能的聊天机器人用户界面

Figure 11.3: Voice-enabled chatbot UI

了解支持语音的多模态 RAG 聊天机器人的核心执行流程,本文将介绍这款基于 Streamlit 的应用该应用集成了本地 LLM 推理、语音处理和动态文档检索功能。从环境设置和模块导入到通过键盘或麦克风进行实时交互,整个流程协调 LLM 调用、基于图的推理和语音合成,从而提供流畅的用户体验。以下我们将详细介绍该系统中实现多模态交互的主要步骤:

To understand the core execution flow of the voice-enabled multimodal RAG chatbot. This Streamlit-based application integrates local LLM inference, speech processing, and dynamic document retrieval. From environment setup and module imports to real-time interaction via keyboard or microphone, the pipeline orchestrates LLM invocation, graph-based reasoning, and speech synthesis to deliver a seamless user experience. Here, we break down the major steps that enable multimodal interaction in this system:

1.环境设置和导入:脚本首先解析模块路径并导入依赖项:

1. Environment setup and imports: The script begins by resolving the module path and importing dependencies:

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), ".")))

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

from rag.ollama_llm import get_llm

from rag.ollama_llm import get_llm

from rag.graph_workflow import graph

from rag.graph_workflow import graph

from rag.voice import listen_from_microphone, speak_text

from rag.voice import listen_from_microphone, speak_text

此设置确保可以访问内部 RAG 模块和封装的逻辑,用于 LLM 调用 ( get_llm() )、基于图的推理 ( graph.invoke() ) 和语音交互。

This setup ensures access to internal RAG modules and encapsulated logic for LLM invocation (get_llm()), graph-based reasoning (graph.invoke()), and speech interaction.

2.页面初始化:Streamlit 页面通过标题和布局进行初始化:

2. Page initialization: The Streamlit page is initialized with a title and layout:

st.set_page_config(page_title="RAG 语音聊天机器人", layout="wide")

st.set_page_config(page_title="RAG Chatbot with Voice", layout="wide")

st.title("多模态 RAG 聊天机器人(PDF + Web + 语音)")

st.title("Multimodal RAG Chatbot (PDF + Web + Voice)")

3.输入模式选择:用户可以通过单选按钮选择键盘输入语音输入:

3. Input mode selection: A radio button allows users to choose between Keyboard and Voice input:

input_mode = st.radio("选择输入法:", ["键盘", "语音"], horizo​​ntal=True)

input_mode = st.radio("Choose input method:", ["Keyboard", "Voice"], horizontal=True)

4.键盘交互流程对于文本输入,查询通过st.text_input()字段提交,并在单击“询问”按钮后进行处理:

4. Keyboard interaction flow: For text input, the query is submitted via a st.text_input() field and processed upon clicking the "Ask" button:

query = st.text_input("输入您的问题:")

query = st.text_input("Type your question:")

如果 query.strip() 且 st.button("询问"):

if query.strip() and st.button("Ask"):

with st.spinner("思考中..."):

with st.spinner("Thinking..."):

状态 = graph.invoke({...})

state = graph.invoke({...})

graph.invoke ()函数控制 RAG 管道,并在必要时检索文档和 Web 内容。提示信息会根据检索到的上下文动态构建:

The graph.invoke() function controls the RAG pipeline, retrieving documents and web content if necessary. The prompt is constructed dynamically using retrieved context:

提示 = f"""{前缀}

prompt = f"""{prefix}

你是一位得力的助手。请仅使用以下上下文中的信息……

You are a helpful assistant. Use ONLY the information in the CONTEXT below...

"""

"""

响应由本地 Ollama 托管的 LLM 生成,并通过 st.markdown(...) 返回给用户,同时还会转换为语音:

The response is generated by the local Ollama-hosted LLM and returned to the user via st.markdown(...), while also being converted to speech:

response = llm.invoke([HumanMessage(content=prompt)])

response = llm.invoke([HumanMessage(content=prompt)])

speak_text(final_answer)

speak_text(final_answer)

5.语音交互流程:在“语音”模式下,点击“说出你的问题”会触发实时麦克风捕捉:

5. Voice interaction flow: In the "Voice" mode, clicking "Speak your question" triggers real-time microphone capture:

查询 = listen_from_microphone()

query = listen_from_microphone()

采集到的语音通过SpeechRecognition模块进行转录。转录后的查询遵循与文本相同的逻辑路径:经过检索、提示组装、生成和最终响应渲染。同样,speak_text(final_answer)确保音频输出:

The captured voice is transcribed via the SpeechRecognition module. The transcribed query follows the same logic path as text: passing through retrieval, prompt assembly, generation, and final response rendering. Again, speak_text(final_answer) ensures audio output:

st.success(f"您说的是:{query}")

st.success(f"You said: {query}")

final_answer = response.content.strip()

final_answer = response.content.strip()

speak_text(final_answer)

speak_text(final_answer)

引入异常处理机制是为了报告运行时错误:

Exception handling is incorporated to report runtime errors:

除异常 e 外:

except Exception as e:

st.error(f"语音错误:{e}")

st.error(f" Voice error: {e}")

该前端协调了一个完整的多模态循环,接收语音/文本输入,使用 LangGraph 进行检索,通过 LangChain 调用 LLM,并以视觉和听觉两种方式返回响应。其模块化和清晰的设计使其非常适合具有多模态功能的实时 RAG 交互。

This frontend orchestrates a complete multimodal loop, accepting voice/text, performing retrieval with LangGraph, invoking LLMs via LangChain, and returning responses both visually and auditorily. The modularity and clarity of design make it well-suited for real-time RAG interactions with multimodal capabilities.

主语音管道

Main voice-enabled pipeline

要理解多模态 RAG 聊天机器人的内部运作机制,必须探究其核心 Python 组件的模块化方式。rag /目录下的每个脚本都负责一项特定功能,涵盖文档导入、向量索引、提示构建、查询路由和 LLM 推理等。以下解释 该系统构建了对这些模块按逻辑执行顺序的理解,重点阐述了它们如何协作以实现端到端的语音和网络搜索的 RAG(资源可用性)功能。系统首先加载 PDF 文档(loaders.py ),并将其转换为向量嵌入(embeddings.py ),这些向量嵌入存储在向量数据库(vectorstore.py )中。当用户提出查询时,系统首先尝试在本地检索相关文档。如果找不到上下文,则使用 Tavily(tavily_search.py ​​)查询网络。router.py模块负责在这两个来源之间进行选择。然后,上下文被格式化为结构化提示(prompts.py ),并使用 Ollama( ollama_llm.py )传递给本地 LLM(语言学习管理)。执行流程由graph_workflow.py使用 LangGraph 进行管理,而utils.py 则支持整个流程中的格式化和预处理。

To understand the internal workings of the multimodal RAG chatbot, it is essential to explore how the system is modularized across its core Python components. Each script in the rag/ directory is responsible for a specific function, ranging from document ingestion and vector indexing to prompt construction, query routing, and LLM inference. The following explanation builds an understanding of these modules in a logical execution order, highlighting how they collaborate to enable end-to-end RAG with voice and web search capabilities. The system begins by loading PDF documents (loaders.py) and transforming them into vector embeddings (embeddings.py), which are stored in a vector database (vectorstore.py). When a user query arrives, the system first tries to retrieve relevant documents locally. If no context is found, it queries the web using Tavily (tavily_search.py). The router.py module decides between these two sources. Context is then formatted into a structured prompt (prompts.py) and passed to a local LLM using Ollama (ollama_llm.py). The execution flow is managed by graph_workflow.py using LangGraph, while utils.py supports formatting and preprocessing throughout the pipeline.

本节详细阐述了语音多模态 RAG 框架的基础组件和运行逻辑。它系统地探讨了文档预处理流程、嵌入策略、矢量索引机制、基于 LLM 的查询路由以及图驱动的控制流,从而展示了一个用于扎根于现实的语音集成信息检索的统一架构。

This section delineates the foundational components and operational logic underlying the voice-enabled multimodal RAG framework. It systematically explores the document preprocessing pipeline, embedding strategies, vector indexing mechanisms, LLM-based query routing, and graph-driven control flow, thereby illustrating a cohesive architecture for grounded, speech-integrated information retrieval.

PDF 加载和load_pdfs()函数如下:

PDF loading and chunking load_pdfs() functions are as follows:

  • load_pdfs ()函数旨在自动化 RAG 流水线中 PDF 文档的导入和预处理,以便进行下游的嵌入和检索。它执行两项关键任务:文档加载和文本分块。

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    导入操作系统

    • 功能概述

      def load_pdfs ( folder_path ):

      文档 = []

      splitter = RecursiveCharacterTextSplitter(chunk_size= 1000 , chunk_overlap= 200 )

      • 初始化一个空列表文档,用于收集所有已处理的数据块。
      • 实例化一个递归字符文本分割器,大小1000 个字符,块重叠200 个字符。这种重叠确保相邻块之间的语义连续性,有利于上下文感知检索。
    • 文件遍历和加载

      for filename in os.listdir(folder_path):

      如果文件名以“pdf”结尾:

      loader = PyPDFLoader(os.path.join(folder_path, filename))

      文档 = loader.load()

      • 遍历给定folder_path中的所有文件
      • 对于每个.pdf文件,从 LangChain 的文档加载器模块创建一个PyPDFLoader实例。
      • load ()方法提取原始文本内容,通常按页分隔。
    • 拆分与聚合

      documents.extend(splitter.split_documents(docs))

      返回单据

      • 使用RecursiveCharacterTextSplitter将每个加载的文档分割成重叠的文本块
      • 这些数据块被添加到累积文档列表中,并作为结构化文本片段列表返回,以便进行嵌入。
    • 在流程中的作用此函数通常在索引阶段调用,其中data/documents/ 文件夹中的文档被预处理成可供检索的单元。然后,这些单元被传递给嵌入模型(例如,在embeddings.py中),并存储在向量数据库中(例如,在vectorstore.py中)。
  • The load_pdfs() function is designed to automate the ingestion and preprocessing of PDF documents for downstream embedding and retrieval in a RAG pipeline. It performs two critical tasks: document loading and text chunking.

    from langchain_community.document_loaders import PyPDFLoader

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    import os

    • Function overview:

      def load_pdfs(folder_path):

      documents = []

      splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

      • Initializes an empty list document to collect all processed chunks.
      • Instantiates a RecursiveCharacterTextSplitter with a chunk_size of 1000 characters and a chunk_overlap of 200 characters. This overlap ensures semantic continuity between adjacent chunks, which is beneficial for context-aware retrieval.
    • File traversal and loading:

      for filename in os.listdir(folder_path):

      if filename.endswith(".pdf"):

      loader = PyPDFLoader(os.path.join(folder_path, filename))

      docs = loader.load()

      • Iterates through all files in the given folder_path.
      • For each .pdf file, creates a PyPDFLoader instance from LangChain’s document loader module.
      • The load() method extracts raw text content, typically separated by page.
    • Splitting and aggregation:

      documents.extend(splitter.split_documents(docs))

      return documents

      • Each loaded document is split into overlapping text chunks using the RecursiveCharacterTextSplitter.
      • These chunks are appended to the cumulative documents list and returned as a list of structured text segments ready for embedding.
    • Role in the pipeline: This function is typically called during the indexing phase, where documents from the data/documents/ folder are preprocessed into retrieval-ready units. These units are then passed to an embedding model (e.g., in embeddings.py) and stored in a vector database (e.g., in vectorstore.py).
  • 嵌入初始化get_embeddings()函数封装了本地托管的嵌入模型的实例化,特别是来自 LangChain–Nomic 集成的嵌入模型,用于 RAG 管道中的文档向量化。

    from langchain_nomic.embeddings import NomicEmbeddings

    导入语句加载NomicEmbeddings包装器,它为nomic-embed-text-v1.5模型提供了一个标准的 LangChain 兼容接口,该模型是一个针对语义搜索和文档检索任务优化的高性能嵌入模型。

    • 函数定义

      def get_embeddings():

      返回 NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")

      • model="nomic-embed-text-v1.5"指定要使用的嵌入模型版本。该模型支持文本数据的密集向量表示,并针对信息检索任务进行了优化。
      • `inference_mode="local"`表示嵌入计算将在本地机器上运行,而不是通过远程 API 运行。这可以实现更快的离线嵌入,并消除对外部服务的依赖。
    • 在 RAG 流程中的作用:此函数通常在文档索引或查询向量化阶段调用(通常在vectorstore.py内部)。它返回一个嵌入对象,该对象可以传递给 LangChain 的向量存储抽象层,用于存储和搜索文档块。

      通过将嵌入逻辑封装在get_embeddings()中,该系统确保了即插即用的架构,无需修改下游代码即可轻松替换模型或更改配置。

  • Embedding initialization: The get_embeddings() function encapsulates the instantiation of a locally hosted embedding model, specifically from the LangChain–Nomic integration, for use in document vectorization within a RAG pipeline.

    from langchain_nomic.embeddings import NomicEmbeddings

    This import statement loads the NomicEmbeddings wrapper, which provides a standard LangChain-compatible interface to the nomic-embed-text-v1.5 model, a high-performance embedding model optimized for semantic search and document retrieval tasks.

    • Function definition:

      def get_embeddings():

      return NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")

      • model="nomic-embed-text-v1.5" specifies the embedding model version to be used. This model supports dense vector representations for textual data and is tuned for information retrieval tasks.
      • inference_mode="local" indicates that the embedding computation will run on the local machine rather than through remote APIs. This enables faster, offline embedding and eliminates dependency on external services.
    • Role in the RAG pipeline: This function is typically called in the document indexing or query vectorization stage (often inside vectorstore.py). It returns an embedding object that can be passed into LangChain’s vector store abstraction for both storing and searching against document chunks.

      By encapsulating the embedding logic in get_embeddings(), the system ensures a plug-and-play architecture, facilitating easy model replacement or configuration changes without modifying downstream code.

  • 向量索引构建create_vectorstore()函数负责使用LangChain 的SKLearnVectorStore将嵌入的文档块转换为可搜索的向量索引,SKLearnVectorStore 是一个基于 scikit-learn 最近邻算法构建的轻量级内存向量存储。

    from langchain_community.vectorstores import SKLearnVectorStore

    此导入语句引入了LangChain 社区模块中的SKLearnVectorStore实现。它特别适用于不需要持久化或大规模向量存储(例如 Qdrant、Faiss)的本地原型开发环境。

    • 函数定义

      def create_vectorstore(docs, embedding):

      返回 SKLearnVectorStore.from_documents(docs, embedding)

      • docs:文档块列表,通常通过分割器(例如,RecursiveCharacterTextSplitter )进行预处理,并从加载器管道返回。
      • embedding:嵌入模型对象,例如get_embeddings()返回的对象(例如,使用 Nomic Embeddings),它将每个文本块转换为高维向量表示。
  • Vector index construction: The create_vectorstore() function is responsible for transforming embedded document chunks into a searchable vector index using LangChain’s SKLearnVectorStore, a lightweight in-memory vector store built on top of scikit-learn’s nearest neighbor algorithms.

    from langchain_community.vectorstores import SKLearnVectorStore

    This import brings in the SKLearnVectorStore implementation from LangChain’s community module. It is particularly well-suited for local, prototyping environments where persistent or large-scale vector storage (e.g., Qdrant, Faiss) is not required.

    • Function definition:

      def create_vectorstore(docs, embedding):

      return SKLearnVectorStore.from_documents(docs, embedding)

      • docs: A list of document chunks, typically preprocessed via a splitter (e.g., RecursiveCharacterTextSplitter) and returned from the loader pipeline.
      • embedding: An embedding model object, such as the one returned by get_embeddings() (e.g., using Nomic Embeddings), which converts each text chunk into a high-dimensional vector representation.

from_documents ()方法执行两个操作:

The from_documents() method performs two operations:

  • 使用指定的嵌入函数嵌入所有文档块
  • Embeds all the document chunks using the specified embedding function.
  • 使用 scikit-learn 的最近邻搜索(例如,使用余弦距离或欧氏距离的NearestNeighbors 构建内部索引。
  • Constructs an internal index using scikit-learn’s nearest neighbor search (e.g., NearestNeighbors with cosine or Euclidean distance).
  • 在 RAG 流程中的作用此函数通常在索引阶段调用,将加载和拆分的文档转换为向量形式并建立索引以供检索。在推理阶段,此向量存储支持在用户查询和已索引文档之间进行语义相似性搜索。

    由于SKLearnVectorStore具有内存特性,因此非常适合开发和测试,但可能无法扩展到生产环境,因为生产环境需要持久化或分布式索引(例如,通过 Qdrant 或 Pinecone)。

  • Role in the RAG pipeline: This function is typically invoked during the indexing phase, where loaded and split documents are converted into vector form and indexed for retrieval. At inference time, this vector store enables semantic similarity search between user queries and the indexed documents.

    Due to its in-memory nature, SKLearnVectorStore is ideal for development and testing but may not scale for production use where persistent or distributed indexing (e.g., via Qdrant or Pinecone) is needed.

  • 网络检索备用方案:当本地向量数据库无法返回相关上下文时, `search_tavily()`函数会与 Tavily 搜索API 交互,从互联网检索实时更新的内容。它作为RAG 管道中的网络搜索备用模块,确保在索引文档不足或过时的情况下也能保持系统的稳健性。

    导入操作系统

    导入请求

    from langchain.schema import Document

    from dotenv import load_dotenv

    该函数从 LangChain 导入环境变量、网络请求工具和文档模式,以保持与 RAG 架构其余部分的兼容性。

    • 环境配置和API设置

      load_dotenv()

      api_key = os.getenv("TAVILY_API_KEY")

      • 使用python-dotenv从本地.env文件加载 Tavily API 密钥
      • 确保在代码库之外进行安全的凭证管理。
    • 网页搜索请求和错误处理

      url = "https://api.tavily.com/search"

      headers = {"Authorization": f"Bearer {api_key}"}

      有效载荷 = {"查询": 查询, "结果数": 最大结果数}

      • 使用搜索查询和可选的max_results参数(默认值:3 )向 Tavily API 构建 POST 请求。
      • 使用持有者令牌进行安全认证。

      强大的异常处理机制可确保在发生网络错误或出现意外响应时实现优雅降级:

      response = requests.post(url, headers=headers, json=payload)

      response.raise_for_status()

    • 响应解析和输出转换

      data = response.json()

      results = data.get("results", [])

      return [Document(page_content=entry["content"]) for entry in results if "content" in entry]

      • 提取响应 JSON 并将每个结果格式化为 LangChain Document对象。
      • 如果没有找到结果,则返回备用消息:

        返回 [Document(page_content="Tavily 返回了未结果。")]

    • 在 RAG 流水线中的作用:此函数通常在查询路由或回退逻辑中调用,常见于router.pygraph_workflow.py文件中。当向量搜索失败时(例如,相关性低或结果为空),会调用search_tavily()函数。 获取可融入语言模型提示中的最新网络内容。

      这使得实时知识增强成为可能,对于静态文档语料库无法涵盖的、对时间敏感的、基于事实的查询尤其重要。

  • Web retrieval fallback: The search_tavily() function interfaces with the Tavily Search API to retrieve live, up-to-date content from the internet when the local vector database fails to return to the relevant context. It acts as a web search fallback module within a RAG pipeline, ensuring robustness in cases where indexed documents are insufficient or outdated.

    import os

    import requests

    from langchain.schema import Document

    from dotenv import load_dotenv

    This function imports environment variables, network request tools, and the Document schema from LangChain to maintain compatibility with the rest of the RAG architecture.

    • Environment configuration and API setup:

      load_dotenv()

      api_key = os.getenv("TAVILY_API_KEY")

      • Loads the Tavily API key from a local .env file using python-dotenv.
      • Ensures secure credential management outside of the codebase.
    • Web search request and error handling:

      url = "https://api.tavily.com/search"

      headers = {"Authorization": f"Bearer {api_key}"}

      payload = {"query": query, "num_results": max_results}

      • Constructs a POST request to the Tavily API with the search query and an optional max_results parameter (default: 3).
      • Uses a bearer token for secure authentication.

      Robust exception handling ensures graceful degradation in case of network errors or unexpected responses:

      response = requests.post(url, headers=headers, json=payload)

      response.raise_for_status()

    • Response parsing and output conversion:

      data = response.json()

      results = data.get("results", [])

      return [Document(page_content=entry["content"]) for entry in results if "content" in entry]

      • Extracts the response JSON and formats each result into a LangChain Document object.
      • If no results are found, a fallback message is returned:

        return [Document(page_content="Tavily returned no results.")]

    • Role in the RAG pipeline: This function is typically invoked within the query router or fallback logic, often in router.py or graph_workflow.py. When the vector search fails (e.g., low relevance or empty results), search_tavily() fetches fresh web content that can be incorporated into the prompt fed to the language model.

      This enables live knowledge augmentation, especially critical for time-sensitive, fact-based queries not covered by static document corpora.

  • 查询路由逻辑 - route_question_and_get_source() route_question_and_get_source()函数在多模态 RAG 系统中的自适应查询处理中起着至关重要的作用。它的目的是根据语言模型的判断,动态地决定给定的用户查询应该使用本地嵌入文档( “pdf” )来回答,还是执行实时网络搜索回退(“web” )。

    从 langchain.schema 导入 HumanMessage、SystemMessage

    from rag.ollama_llm import get_llm

    from rag.prompts import router_instructions

    from rag.utils import safe_json_parse

    此设置导入了与 LangChain 兼容的消息模式、LLM 接口、预定义的路由指令(作为系统提示)以及用于强大的 JSON 解析的实用函数。

    • 功能定义和面向LLM的决策

      def route_question_and_get_source(question: str) -> str:

      • 接受自然语言问题作为输入。
      • 返回源决策:“pdf” (基于矢量存储的检索)或“web” (Tavily API 回退)。
    • 提示构造和LLM调用

      消息 = [

      SystemMessage(content=router_instructions),

      HumanMessage(内容=问题)

      ]

      response = llm_json_mode.invoke(messages)

      该功能向LLM发送两轮对话:

      • 从router_instructions加载的系统消息,指导 LLM 如何做出路由决策。
      • 包含用户实际查询内容的人工消息。

      这种方法利用 LLM 驱动的控制流,其中模型返回结构化的 JSON 输出,指示首选数据源。

    • 安全的JSON解析和决策逻辑

      result = safe_json_parse(response.content)

      datasource = result.get("datasource", "vectorstore").lower()

      • 使用容错实用程序safe_json_parse()解析 LLM 的响应,以优雅地处理非结构化或格式错误的输出。
      • 如果 LLM 建议“websearch” ,则该函数返回“web” ;否则,默认返回“pdf”

        如果数据源是“websearch”,则返回“web”,否则返回“pdf”。

        如果发生故障(例如,解析错误或意外内容),系统默认使用本地文档矢量存储

        除异常 e 外:

        print(f"[ROUTER] JSON 解析失败:{e}")

        返回“pdf”

    • 在 RAG 流水线中的作用:此路由功能通常在检索之前调用,充当门控机制,用于:
      • 对于可能从索引语料库中找到答案的问题,使用基于向量的检索方法。
      • 对于有时效性要求、领域外或一般知识查询,请使用网络搜索备选方案(例如 Tavily)。

      该设计在响应来源方面引入了动态适应性,无需手动基于规则的路由即可确保更高的响应相关性。

  • Query routing logic-route_question_and_get_source(): The route_question_and_get_source() function plays a pivotal role in adaptive query handling within a multimodal RAG system. Its purpose is to dynamically decide whether a given user query should be answered using locally embedded documents ("pdf") or by performing a live web search fallback ("web"), based on the judgment of a language model.

    from langchain.schema import HumanMessage, SystemMessage

    from rag.ollama_llm import get_llm

    from rag.prompts import router_instructions

    from rag.utils import safe_json_parse

    This setup imports LangChain-compatible message schemas, an LLM interface, pre-defined routing instructions as system prompts, and a utility function for robust JSON parsing.

    • Function definition and LLM-oriented decision making:

      def route_question_and_get_source(question: str) -> str:

      • Accepts a natural language question as input.
      • Returns a source decision: "pdf" (vectorstore-based retrieval) or "web" (Tavily API fallback).
    • Prompt construction and LLM invocation:

      messages = [

      SystemMessage(content=router_instructions),

      HumanMessage(content=question)

      ]

      response = llm_json_mode.invoke(messages)

      The function sends a two-turn conversation to the LLM:

      • A system message loaded from router_instructions, which guides the LLM on how to make the routing decision.
      • The human message containing the user's actual query.

      This approach leverages LLM-driven control flow, where the model returns a structured JSON output indicating the preferred data source.

    • Safe JSON parsing and decision logic:

      result = safe_json_parse(response.content)

      datasource = result.get("datasource", "vectorstore").lower()

      • The LLM's response is parsed using a fault-tolerant utility safe_json_parse() to handle unstructured or malformed outputs gracefully.
      • If the LLM suggests "websearch", the function returns "web"; otherwise, it defaults to "pdf".

        return "web" if datasource == "websearch" else "pdf"

        In the event of a failure (e.g., parsing error or unexpected content), the system defaults to using the local document vectorstore:

        except Exception as e:

        print(f"[ROUTER] JSON parsing failed: {e}")

        return "pdf"

    • Role in the RAG pipeline: This routing function is typically invoked prior to retrieval, acting as a gating mechanism to:
      • Use vector-based retrieval for questions likely to be answered from the indexed corpus.
      • Use web search fallback (e.g., Tavily) for time-sensitive, out-of-domain, or general knowledge queries.

      This design introduces dynamic adaptability in response sourcing, ensuring higher response relevance without manual rule-based routing.

  • RAG 系统中控制和精度的提示工程:提示工程对于 RAG 系统中 LLM 的运行至关重要。以下两个提示分别是router_instructionsrag_prompt ,它们分别服务于特定的、范围明确的功能:查询路由和生成接地响应。
    • 路由提示

      router_instructions = """

      您就像一台路由器,决定是使用用户的私有 PDF 文件还是通过网络搜索来回答问题。

      如果问题与 LangChain、提示工程或其他文档中涵盖的主题有关 → 请使用“vectorstore”。

      如果问题与近期事件、天气、人物、地点或现实世界数据有关 → 请使用“网络搜索”。

      仅返回类似这样的 JSON:

      { "数据源": "网络搜索" }

      或者

      { "数据源": "向量存储" }

      """

    • 目的
      • 此提示用于控制作为路由代理的 LLM 的决策过程。
      • 它将传入的查询分为两类:用于基于 PDF 的文档检索(vectorstore)回退到通过 Tavily 进行实时在线搜索(websearch)
    • 设计考虑因素
      • 使用约束输出(“数据源”)的二元分类(“...” )可确保可预测的下游逻辑。
      • 它利用特定领域的线索(例如 LangChain、提示工程),而不是开放领域的现实世界问题(例如天气、最近发生的事件)。
      • 该方法允许 RAG 系统使用基于 LLM 的控制流动态调整其检索策略,从而减少硬编码规则。
    • 答案生成提示

      rag_prompt = """

      你是一位得力的助手。请仅使用以下上下文中的信息来回答问题。

      如果上下文中没有包含答案,请直接回答:

      “根据上下文,我无法判断。”

      ---

      语境:

      {语境}

      ---

      问题:

      {问题}

      ---

      指示:

      - 请勿使用先前的知识。

      - 请勿编造答案。

      - 仅使用上述上下文中的信息。

      请用2-3句简洁的句子回答。

      - 如果不确定,就说“根据上下文,我不知道”。

      ---

      回答:

      """

    • 目的
      • 此提示指示 LLM 生成一个严格基于从向量搜索或网络搜索中检索到的上下文的答案。
      • 它通过禁止使用先前的知识或推测来防止幻觉的发生。
    • 设计特点
      • 清晰的输出边界:分隔符(--- )用于构建上下文、问题和规则。
      • 后备安全网:明确指示根据上下文回复“我不知道”,可以降低未经证实的说法所带来的风险。
      • 简洁性:为了简洁明了,请将回答限制在两到三句话以内。
    • 在产品线中的角色
      • router_instructions提示符在管道的早期阶段(检索之前)使用,用于指导数据源的选择。
      • 在检索之后、LLM 响应生成阶段使用 rag_prompt 以确保答案在上下文上保持准确和可验证。

      这些提示共同确保了智能路由和接地生成,这是可信的 RAG 系统的两个基本组成部分。

  • Prompt engineering for control and accuracy in RAG systems: Prompt engineering is critical in steering LLMs within RAG systems. The following two prompts are router_instructions and rag_prompt, they serve specific, tightly scoped functions: query routing and grounded response generation, respectively.
    • Routing prompt:

      router_instructions = """

      You are a router deciding whether a question should be answered using the user's private PDFs or from a web search.

      If the question is about LangChain, prompt engineering, or other topics covered in the provided documents → use 'vectorstore'.

      If the question is about recent events, weather, people, locations, or real-world data → use 'websearch'.

      Return ONLY a JSON like:

      { "datasource": "websearch" }

      or

      { "datasource": "vectorstore" }

      """

    • Purpose:
      • This prompt is used to control the decision-making process of an LLM acting as a router agent.
      • It classifies incoming queries into two categories: for PDF-based document retrieval, vectorstore, and for fallback to real-time online search via Tavily, websearch.
    • Design considerations:
      • Binary classification using constrained output ("datasource": "...") ensures predictable downstream logic.
      • It leverages domain-specific cues (e.g., LangChain, prompt engineering) vs. open-domain, real-world questions (e.g., weather, recent events).
      • This method allows the RAG system to dynamically adapt its retrieval strategy using LLM-based control flow, reducing hard-coded rules.
    • Answer generation prompt:

      rag_prompt = """

      You are a helpful assistant. Use ONLY the information in the CONTEXT below to answer the QUESTION.

      If the CONTEXT does not contain the answer, respond exactly with:

      "I don’t know based on the context."

      ---

      CONTEXT:

      {context}

      ---

      QUESTION:

      {question}

      ---

      INSTRUCTIONS:

      - Do NOT use prior knowledge.

      - Do NOT make up any answers.

      - ONLY use information in the context above.

      - Answer in 2–3 concise sentences.

      - Say “I don’t know based on the context” if unsure.

      ---

      Answer:

      """

    • Purpose:
      • This prompt instructs the LLM to generate an answer strictly grounded in the retrieved context from vector search or web search.
      • It enforces hallucination prevention by disallowing use of prior knowledge or speculation.
    • Design features:
      • Clear output boundaries: Separators (---) structure the context, question, and rules.
      • Fallback safety net: Explicit instruction to reply with I don’t know based on the context mitigates the risk of unsupported claims.
      • Conciseness: Limit responses to two to three sentences for brevity and focus.
    • Role in the pipeline:
      • The router_instructions prompt is used early in the pipeline, prior to retrieval, to guide data source selection.
      • The rag_prompt is used after retrieval, during the LLM response generation phase, to ensure that answers remain contextually accurate and verifiable.

      Together, these prompts ensure both intelligent routing and grounded generation, two foundational components of trustworthy RAG systems.

  • LLM 初始化get_llm()函数提供了一个标准化的接口,用于使用 LangChain 的ChatOllama封装器实例化本地托管的 LLM 。这使得模型调用无需依赖外部 API,从而支持 RAG 系统的离线和私有部署。

    from langchain_community.chat_models import ChatOllama

    导入引用了社区支持的ChatOllama集成,该集成将 LangChain 与通过 Ollama 运行时提供的模型连接起来,Ollama 运行时是一个流行的框架,用于在 CPU 或 GPU 上本地运行轻量级 LLM。

    • 函数定义

      def get_llm():

      返回 ChatOllama(model="mistral")

      • ChatOllama(model="mistral")启动或连接到 Mistral 模型,这是一个高性能的开放权重 LLM,针对快速推理和强大的推理能力进行了优化。
      • 返回一个与 LangChain 兼容的聊天模型对象,该对象支持.invoke().stream()方法以执行交互式完成任务。
    • 在 RAG 流程中的作用该函数在整个流程中凡是需要基于 LLM 的推理的地方都会用到:
      • 路由决策:解释用户查询,以决定使用vectorstore还是 web 回退(router.py )。
      • 答案生成:使用检索到的文档上下文构建响应(graph_workflow.py )。
      • 提示模板:接受来自prompts.py的结构化提示,并生成基于上下文的答案。

      通过将模型实例化包装在get_llm()中,该设计遵循依赖注入最佳实践,允许轻松替换模型(例如,从 Mistral 切换到 Llama 2),而无需更改核心逻辑。

  • LLM Initialization: The get_llm() function provides a standardized interface to instantiate a locally hosted LLM using LangChain's ChatOllama wrapper. This enables model invocation without reliance on external APIs, supporting offline and private deployments of RAG systems.

    from langchain_community.chat_models import ChatOllama

    The import references the community-supported ChatOllama integration, which connects LangChain with models served via the Ollama runtime, a popular framework for running lightweight LLMs locally on CPU or GPU.

    • Function definition:

      def get_llm():

      return ChatOllama(model="mistral")

      • ChatOllama(model="mistral") launches or connects to the Mistral model, a performant open-weight LLM optimized for fast inference and strong reasoning capabilities.
      • Returns a LangChain-compatible chat model object that supports .invoke() and .stream() methods for interactive completion tasks.
    • Role in the RAG pipeline: This function is used throughout the pipeline wherever LLM-based reasoning is required:
      • Routing decisions: Interprets user queries to decide between vectorstore and web fallback (router.py).
      • Answer generation: Constructs responses using retrieved document context (graph_workflow.py).
      • Prompt templating: Accepts structured prompts from prompts.py and produces context-grounded answers.

      By wrapping the model instantiation inside get_llm(), the design follows dependency injection best practices, allowing easy substitution of models (e.g., switching from Mistral to Llama 2) without changing the core logic.

  • 基于图的 RAG 工作流与 LangGraph graph_workflow.py模块使用 LangGraph(一个用于构建 LLM 集成状态机的框架)定义了 RAG 系统的逻辑和控制流程。该系统通过结构化的、状态驱动的执行计划,支持混合检索(PDF + 网页)、摘要生成、提示生成和质量控制。
    • 核心组件和初始设置

      llm = get_llm()

      docs = load_pdfs("data/documents")

      retriever = create_vectorstore(docs, get_embeddings()).as_retriever(search_kwargs={"k": min(3, len(docs))})

      • 加载本地 PDF 文档,使用 Nomic Embeddings 嵌入它们,并使用SKLearnVectorStore构建检索器
      • 通过 Ollama 实例化本地 LLM (Mistral)。
    • 状态模式

      class GraphState(TypedDict):

      问题:str

      生成:字符串

      web_search: str

      max_retries: int

      答案:int

      loop_step: Annotated[int, operator.add]

      文档:List[Document]

      以下列表定义了图节点之间传递的状态变量:

      • 问题:用户查询
      • 文档:已检索文档
      • web_search :指示数据源的标志
      • 生成:最终答案或中间答案
      • loop_step :重试逻辑的迭代次数
    • 节点(功能单元)

      检索(状态)

      def retrieve(state):

      返回 {

      "文档": retriever.invoke(state["问题"]),

      "web_search": "否"

      }

      • 根据语义相似性从本地向量存储中检索前 k 个相关文档。

        web_search(state)

        def web_search(state):

        docs = search_tavily(state["question"])

        返回 {

        “文档”:docs,

        "web_search": "是"

        }

      • 通过 Tavily API 获取实时网络结果,并将其包装为Document对象。

        生成(状态)

        def generate(state):

        # 如果来自网络,请进行汇总

        # 将上下文格式化为 rag_prompt

        # 致电LLM获取答案

      • 如果结果来自网络,则在最终生成之前使用摘要提示对其进行精简。
      • 将上下文和问题格式化为结构化的 RAG 提示。
      • 通过llm.invoke()生成最终答案
    • route_question(state) : 调用基于 LLM 的路由器(route_question_and_get_source)来决定是使用“websearch”还是“retrieve”
    • grade_documents(state) :维护文档状态的传递节点。用于未来文档评分的占位符。
    • decide_to_generate(state) : 决定是直接生成还是切换到网络搜索的逻辑:

      如果 state["web_search"] == "No" 则返回 "generate",否则返回 "websearch"

  • Graph-based RAG workflow with LangGraph: The graph_workflow.py module defines the logic and control flow of a RAG system using LangGraph, a framework for constructing LLM-integrated state machines. The system supports hybrid retrieval (PDFs + web), summarization, prompt generation, and quality control through a structured, state-driven execution plan.
    • Core components and initial setup:

      llm = get_llm()

      docs = load_pdfs("data/documents")

      retriever = create_vectorstore(docs, get_embeddings()).as_retriever(search_kwargs={"k": min(3, len(docs))})

      • Loads local PDF documents, embeds them using Nomic Embeddings, and builds a retriever using SKLearnVectorStore.
      • Instantiates the local LLM (Mistral) via Ollama.
    • State schema:

      class GraphState(TypedDict):

      question: str

      generation: str

      web_search: str

      max_retries: int

      answers: int

      loop_step: Annotated[int, operator.add]

      documents: List[Document]

      The following list defines the state variables passed between graph nodes:

      • question: The user query
      • documents: Retrieved docs
      • web_search: Flag indicating data source
      • generation: Final or intermediate answer
      • loop_step: Iteration count for retry logic
    • Graph nodes (functional units):

      retrieve(state)

      def retrieve(state):

      return {

      "documents": retriever.invoke(state["question"]),

      "web_search": "No"

      }

      • Retrieves top-k relevant documents from the local vector store based on semantic similarity.

        web_search(state)

        def web_search(state):

        docs = search_tavily(state["question"])

        return {

        "documents": docs,

        "web_search": "Yes"

        }

      • Fetches real-time web results via the Tavily API and wraps them as Document objects.

        generate(state)

        def generate(state):

        # summarize if from web

        # format context into rag_prompt

        # call LLM to get answer

      • If results are from the web, use a summarization prompt to condense them before final generation.
      • Formats context + question into the structured RAG prompt.
      • Produces the final answer via llm.invoke().
    • route_question(state): Calls the LLM-based router (route_question_and_get_source) to decide whether to use "websearch" or "retrieve".
    • grade_documents(state): Pass-through node that maintains document state. Placeholder for future document scoring.
    • decide_to_generate(state): Logic to determine whether to generate directly or switch to web search:

      Return "generate" if state["web_search"] == "No" else "websearch"

  • grade_generation_v_documents_and_question(state) : 目前硬编码为“有用” ,但保留用于根据上下文和问题评估 LLM 输出质量。
    • 图的构建

      工作流 = 状态图(GraphState)

      workflow.add_node("retrieve", retrieve)

      workflow.add_node("generate", generate)

      workflow.add_node("websearch", web_search)

      workflow.add_node(“grade_documents”, grade_documents)

      定义节点以及它们之间的数据流。`set_conditional_entry_point ()`函数允许图根据 LLM 路由动态选择其起始节点(检索 网络搜索)。

    • 条件边驱动逻辑
      • 文档检索后|评分|决定是生成还是回退到 Web。
      • 生成后|成绩输出|结束、重试或回退。

        workflow.add_conditional_edges("generate", grade_generation_v_documents_and_question, {

        “有用”:结束,

        “无用”: “网络搜索”,

        不支持:生成,

        “最大重试次数”:结束

        })

    • 最终图表汇总

      返回 workflow.compile()

    • 该图被编译成可执行对象并公开为

      graph = build_graph()

    • 在系统中的作用:该模块集成了所有先前的组件:
      • 文档导入
      • 嵌入与检索
      • 查询路由
      • 网络搜索备用方案
      • 快速建设
      • LLM 调用
      • 答案质量门控
  • grade_generation_v_documents_and_question(state): Currently hard-coded to "useful", but reserved for evaluating LLM output quality against context and question.
    • Graph construction:

      workflow = StateGraph(GraphState)

      workflow.add_node("retrieve", retrieve)

      workflow.add_node("generate", generate)

      workflow.add_node("websearch", web_search)

      workflow.add_node(“grade_documents”, grade_documents)

      Defines the nodes and how data flows between them. The set_conditional_entry_point() allows the graph to dynamically choose its starting node (retrieve or websearch) based on LLM routing.

    • Conditional edges drive the logic:
      • After document retrieval | grade | decide whether to generate or fall back to web.
      • After generation | grade output | either end, retry, or fallback.

        workflow.add_conditional_edges("generate", grade_generation_v_documents_and_question, {

        "useful": END,

        "not useful": "websearch",

        "not supported": "generate",

        "max retries": END

        })

    • Final graph compilation:

      return workflow.compile()

    • The graph is compiled into an executable object and exposed as:

      graph = build_graph()

    • Role in the system: This module integrates all prior components:
      • Document ingestion
      • Embedding and retrieval
      • Query routing
      • Web search fallback
      • Prompt construction
      • LLM invocation
      • Answer quality gating

它通过结构化的、可扩展的管道,确保动态、上下文感知和适应性执行,支持重试、回退和摘要。

It ensures dynamic, context-aware, and adaptable execution, supporting retries, fallback, and summarization through a structured, extensible pipeline.

强大的 JSON 解析工具safe_json_parse()函数是一个容错工具,旨在从 LLM 生成的文本中提取和解析 JSON 格式的内容。LLM 甚至当被要求返回结构化数据时,有时会生成额外的自然语言输出或格式错误的 JSON。此实用程序可确保下游组件接收到干净、机器可读的 JSON 对象,从而维护查询路由等自动化工作流程的可靠性。详情如下:

Robust JSON parsing utility: The safe_json_parse() function is a fault-tolerant utility designed to extract and parse JSON-formatted content from LLM-generated text. LLM, even when prompted to return structured data, can sometimes produce additional natural language output or malformed JSON. This utility ensures that downstream components receive clean, machine-readable JSON objects, thereby maintaining reliability in automated workflows such as query routing. The details are as follows:

  • 源代码

    导入 json

    导入 re

    def safe_json_parse(text):

    尝试:

    match = re.search(r'{.*?}', text.strip(), re.DOTALL)

    如果匹配:

    返回 json.loads(match.group(0))

    别的:

    raise ValueError("LLM 输出中未找到 JSON")

    除异常 e 外:

    raise ValueError(f"JSON 解析失败:{e}\n原始文本:\n{text}")

  • Source code:

    import json

    import re

    def safe_json_parse(text):

    try:

    match = re.search(r'{.*?}', text.strip(), re.DOTALL)

    if match:

    return json.loads(match.group(0))

    else:

    raise ValueError("No JSON found in LLM output")

    except Exception as e:

    raise ValueError(f"JSON parsing failed: {e}\nRaw Text:\n{text}")

  • 功能逻辑:正则表达式提取

    match = re.search( r'{.*?}' , text.strip(), re.DOTALL)

    • 使用正则表达式提取 LLM 输出中嵌入的第一个 JSON 对象。
    • re.DOTALL标志可确保多行匹配,这在 JSON 跨越多行时非常有用。
  • Function logic: Regex extraction:

    match = re.search(r'{.*?}', text.strip(), re.DOTALL)

    • Uses a regular expression to extract the first JSON object embedded in the LLM output.
    • The re.DOTALL flag ensures multiline matches, useful if the JSON spans multiple lines.
  • JSON解码
  • JSON decoding:

返回 json.loads(match.group(0))

return json.loads(match.group(0))

  • 如果找到 JSON 对象,则使用 Python 内置的json模块对其进行解析。
  • If a JSON object is found, it is parsed using Python’s built-in json module.
  • 结果以 Python 字典的形式返回。
  • The result is returned as a Python dictionary.
  • 错误处理
    • 如果找不到匹配项或json.loads()引发异常,该函数将引发ValueError,其中包含错误消息和用于调试的原始输入文本。

      raise ValueError (f"JSON 解析失败:{e}\n原始文本:\n{text}")

    • 在 RAG 流程中的作用:此实用程序主要用于需要从 LLM 输出结构化 JSON 数据的模块,尤其是在以下模块中:
      • router.py — 解析 LLM 输出以确定是否将查询路由到“websearch” “vectorstore”

        safe_json_parse()作为防御性编程层,可以降低因 LLM 响应格式错误或噪声过大而导致下游崩溃的风险,从而实现更可靠、更适用于生产环境的管道。

  • Error handling:
    • If no match is found or json.loads() raises an exception, the function raises a ValueError with both the error message and the raw input text for debugging.

      raise ValueError(f"JSON parsing failed: {e}\nRaw Text:\n{text}")

    • Role in the RAG pipeline: This utility is primarily used in modules where structured JSON output is expected from the LLM, particularly in:
      • router.py — when parsing LLM output to determine whether to route a query to "websearch" or "vectorstore".

        By acting as a defensive programming layer, safe_json_parse() mitigates the risk of downstream crashes due to malformed or noisy LLM responses, enabling a more reliable and production-ready pipeline.

  • 一次性文档索引脚本run_once.py脚本作为一次性执行工具,用于在 RAG 流水线中进行实时查询处理之前,摄取、嵌入和索引文档。它以模块化的方式调用 rag /包中的关键函数,以准备用于语义检索的本地向量存储。
    • 功能演示

      from rag.loaders import load_pdfs

      from rag.embeddings import get_embeddings

      from rag.vectorstore import create_vectorstore

      这些导入以模块化的方式封装了文档加载(load_pdfs() )、嵌入初始化(get_embeddings() )和向量索引创建(create_vectorstore() )。

    • 主管道执行

      def main():

      print(" 📚从 data/documents 加载文档...")

      docs = load_pdfs("data/documents")

      print(f"已加载 {len(docs)} 个文档。")

      • 从data/documents/目录加载所有.pdf文件
      • 将它们分割成重叠的文本块,以便进行嵌入和语义检索。

        print(" 🧠创建矢量商店...")

        vectorstore = create_vectorstore(docs, get_embeddings())

      • 实例化嵌入模型(例如,nomic-embed-text-v1.5 ),并计算每个块的稠密向量表示。
      • 将嵌入存储在内存中的向量索引(例如,SKLearnVectorStore )中。

        print(f"已将 {len(docs)} 个文档嵌入到向量存储中。")

      • 确认索引过程已完成。

        如果 __name__ == "__main__":

        主要的()

      • 确保脚本仅在直接运行时执行,而不是在导入时执行。
    • 在系统中的作用和功能
      • 该脚本不属于实时推理流程的一部分;相反,它用于准备聊天机器人运行时将要查询的知识库。
      • 它确保文档导入和矢量存储构建只执行一次,因此非常适合批量处理或开发初始化。
    • 使用场景

      运行此脚本:

      python run_once.py

      在部署前端或调用 LangGraph 工作流之前,填充您的矢量存储库。

  • One-time document indexing script: The run_once.py script serves as a one-time execution utility for ingesting, embedding, and indexing documents before real-time query processing in the RAG pipeline. It modularly invokes key functions from the rag/ package to prepare the local vectorstore used for semantic retrieval.
    • Functional walkthrough:

      from rag.loaders import load_pdfs

      from rag.embeddings import get_embeddings

      from rag.vectorstore import create_vectorstore

      These imports modularly encapsulate document loading (load_pdfs()), embedding initialization (get_embeddings()), and vector index creation (create_vectorstore()).

    • Main pipeline execution:

      def main():

      print("📚 Loading documents from data/documents...")

      docs = load_pdfs("data/documents")

      print(f"Loaded {len(docs)} documents.")

      • Loads all .pdf files from the data/documents/ directory.
      • Splits them into overlapping text chunks suitable for embedding and semantic retrieval.

        print("🧠 Creating vectorstore...")

        vectorstore = create_vectorstore(docs, get_embeddings())

      • Instantiates an embedding model (e.g., nomic-embed-text-v1.5) and computes dense vector representations of each chunk.
      • Stores the embeddings in an in-memory vector index (e.g., SKLearnVectorStore).

        print(f"Embedded {len(docs)} documents into vectorstore.")

      • Confirms the completion of the indexing process.

        if __name__ == "__main__":

        main()

      • Ensures the script executes only when run directly, not when imported.
    • Purpose and role in the system:
      • This script is not part of the live inference pipeline; rather, it prepares the knowledge base that will be queried by the chatbot at runtime.
      • It ensures that document ingestion and vectorstore construction are executed once, making it ideal for batch processing or development initialization.
    • Usage context:

      Run this script:

      python run_once.py

      To populate your vectorstore before deploying the frontend or invoking the LangGraph workflow.

该系统的初衷很简单:以语音作为主要输入方式来增强 RAG(红绿灯)系统,使交互更加自然、便捷且以用户为中心。通过精心设计的模块化方案,该项目最终发展成为一个强大的多模态 AI 助手,能够读取本地文档、智能路由查询、通过矢量搜索或网络回退获取上下文信息,并利用本地部署的 LLM(逻辑逻辑模型)生成可靠且有理有据的回复。

This system began with a simple premise: to augment RAG with voice as a primary input modality, making interaction more natural, accessible, and user-centric. Through careful modular design, the project evolved into a robust multimodal AI assistant, capable of ingesting local documents, intelligently routing queries, retrieving context via vector search or web fallback, and generating grounded, reliable responses using a locally hosted LLM.

我们首先启用了语音合成输入和文本转语音输出,将人声无缝集成到 RAG 反馈回路中。然后,我们引入了基于图的编排层(使用 LangGraph),实现了诸如网页内容摘要、查询重试以及优雅地处理文档覆盖缺口等条件流程。

We started by enabling STT input and TTS output, seamlessly integrating human voice into the RAG feedback loop. We then introduced a graph-based orchestration layer using LangGraph, allowing conditional flows such as summarizing web content, retrying queries, and gracefully handling document coverage gaps.

每个 Python 模块都是专门设计的:loaders.py用于文档导入,embeddings.py用于向量化,router.py用于基于 LLM 的源路由,graph_workflow.py用于状态驱动控制。我们使用 Streamlit 构建了一个简洁而高效的前端,使用户能够通过语音或文本与后端执行流程保持一致。

Each Python module was purpose-built; loaders.py for document ingestion, embeddings.py for vectorization, router.py for LLM-based source routing, and graph_workflow.py for state-driven control. A minimalist yet effective frontend was built using Streamlit, allowing users to interact via voice or text with a consistent backend execution flow.

该系统不仅展现了语音赋能的 RAG 架构的潜力,也为进一步扩展到图像、视频或实时多模态应用奠定了基础。它高效、合乎伦理且交互式地弥合了人类沟通与基于现实的 AI 推理之间的鸿沟。完整的代码可在Chapter_11 的 code.Zip 文件中获取

This system not only showcases the potential of voice-enabled RAG architectures but also provides a foundation for further extension into image, video, or real-time multimodal applications. In doing so, it bridges the gap between human communication and grounded AI reasoning—efficiently, ethically, and interactively. The end-to-end code is available at Chapter_11, code.Zip.

结论

Conclusion

本章探讨了 RAG 系统如何超越传统的图像和文本输入,重点关注语音作为核心模态的集成。我们考察了支持语音的多模态 RAG 流水线的概念和架构基础,详细阐述了语音文本转文本 (STT) 和文本转语音 (TTS) 界面如何增强自然交互。该系统能够动态地在本地向量搜索和基于 Web 的检索之间路由查询,确保响应真实可靠且具有上下文感知能力。我们还剖析了完整的实现过程——从文档导入到基于 LangGraph 的编排和前端部署,展示了模块化代码设计如何支持实时语音驱动的 AI 体验。这些组件共同展示了语音如何增强 RAG 系统,从而实现更丰富、更易用的应用。

This chapter explored the evolution of RAG systems beyond traditional image and text inputs, emphasizing the integration of voice as a core modality. We examined the conceptual and architectural foundations of a voice-enabled multimodal RAG pipeline, detailing how STT and TTS interfaces can enhance natural interaction. The system dynamically routes queries between local vector search and web-based retrieval, ensuring grounded, context-aware responses. We also dissected the full implementation—from document ingestion to LangGraph-based orchestration and frontend deployment, demonstrating how modular code design supports real-time, speech-driven AI experiences. Together, these components illustrate how voice augments RAG systems for richer, more accessible applications.

下一章深入探讨了推理和重排序技术,并深入分析了它们在提高 RAG 系统中的响应质量方面的作用。

The following chapter delves into reasoning and reranking techniques, offering insights into their roles in enhancing response quality within RAG systems.

第十二先进的多模态基因人工智能系统

CHAPTER 12Advanced Multimodal GenAI Systems

介绍

Introduction

随着生成式人工智能GenAI )的不断发展,仅仅检索和生成内容的能力已远远不够。真正智能的系统必须能够推理,解读文本和图像等多种模态,并从众多可能的输出中选择最佳答案。本章将通过介绍思维链CoT )提示结合重排序,拓展多模态生成式人工智能的边界,使您的模型能够逐步思考并做出明智的选择。

As generative AI (GenAI) continues to evolve, the ability to simply retrieve and generate content is no longer enough. Truly intelligent systems must be able to reason, interpret diverse modalities like text and images, and select the best response from many possible outputs. This chapter pushes the boundaries of multimodal GenAI by introducing you to chain of thoughts (CoT) prompting combined with reranking, enabling your models to think step-by-step and choose wisely.

本章将探讨如何构建模型不仅能响应指令,还能进行深思熟虑的系统。您将学习如何引导模型完成明确的推理步骤,整合来自检索文档和图像信息的上下文,然后应用多遍重排序,根据质量、相关性或特定任务约束来优化答案。

In this chapter, you will explore how to architect systems where models do not just respond but rather deliberate. You will learn to guide models through explicit reasoning steps, integrating context from both retrieved documents and image-based information, and then apply multi-pass reranking to refine answers based on quality, relevance, or task-specific constraints.

通过使用 LangChain、Ollama 和自定义 CoT 模板进行实践,您将构建统一的多模态流程,使文本和图像信号融合,从而支持稳健的决策。课程内容包括少样本 CoT 策略、动态提示构建和上下文感知重排序,最终目标是开发功能强大的、推理增强的多模态应用程序。

Through hands-on implementation using LangChain, Ollama, and custom CoT templates, you will build unified multimodal flows where text and image signals converge to support robust decision-making. Topics include few-shot CoT strategies, dynamic prompt construction, and context-aware reranking, all of which culminate in the development of powerful, reasoning-augmented multimodal applications.

在本章结束时,您将构建一个复杂的 GenAI 系统,该系统能够执行视觉问答( QA )、多模态文档分析和逐步分析。逐步进行上下文决策,为下一代人工智能铺平道路,使其能够像检索信息一样进行推理。

By the end of this chapter, you will have constructed a sophisticated GenAI system capable of performing visual question answering (QA), multimodal document analysis, and step-by-step contextual decision-making, paving the way for next-generation AI that can reason as well as it retrieves.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 推理在生成式人工智能系统中的关键作用
  • The critical role of reasoning in generative AI systems
  • GenAI中的推理及其类型
  • Reasoning in GenAI and their types
  • 基于工具的推理和ReAct代理
  • Tool-based reasoning and ReAct agents
  • 关于推理基准
  • About reasoning benchmarks

目标

Objectives

本章侧重理论,旨在帮助您理解GenAI系统中推理的核心概念。本章将探讨推理对于构建智能、可靠且可解释的AI模型至关重要的原因。我们将考察各种推理类型,包括演绎推理、归纳推理、溯因推理、类比推理、常识推理、因果推理、数学推理、空间推理、时间推理、工具推理和多模态推理,并解释每种推理方式如何提升性能和决策能力。您还将学习诸如CoT提示和推理行动ReAct )代理等现代技术如何使模型能够逐步进行推理。这些基础知识将为您在后续章节中设计和实现功能更强大、更具上下文感知能力的AI系统奠定基础。

This chapter is theoretical in nature, so that you can understand the core concepts of reasoning in GenAI systems. It explores why reasoning is essential for building intelligent, reliable, and explainable AI models. We examine various types of reasoning, including deductive, inductive, abductive, analogical, commonsense, causal, mathematical, spatial, temporal, tool-based, and multimodal reasoning, and explain how each contributes to improved performance and decision-making. You will also learn how modern techniques like CoT prompting and reasoning and acting (ReAct) agents enable models to reason step-by-step. This foundation will prepare you to design and implement more capable and context-aware AI systems in later chapters.

推理在生成式人工智能系统中的关键作用

The critical role of reasoning in generative AI systems

基因人工智能(GenAI)已从生成文本、代码或图像迅速发展到支持复杂的决策任务。这一发展的核心在于推理能力的整合,即模型不仅能够生成输出,还能理解、规划和解释这些输出。随着基因人工智能的应用领域扩展到多模态领域和高风险环境,推理能力成为区分反应式模型和可靠智能系统的关键因素。

GenAI has rapidly evolved from generating text, code, or images to supporting complex decision-making tasks. At the core of this evolution lies the integration of reasoning capabilities, the ability of a model to not just generate outputs, but to understand, plan, and explain them. As the landscape of GenAI applications expands into multimodal domains and high-stakes environments, reasoning becomes the differentiating factor that transforms a reactive model into a reliable, intelligent system.

从世代到审议

From generation to deliberation

大多数传统的基因人工智能模型依赖于表面模式识别。给定一个提示,它们会根据统计概率生成响应。虽然这对于简单的任务(例如,撰写电子邮件、生成诗歌)很有效,但在以下情况下往往力不从心:

Most traditional GenAI models rely on surface-level pattern recognition. Given a prompt, they generate a response based on statistical likelihood. While this is effective for simple tasks (e.g., drafting an email, generating a poem), it often falls short in scenarios where:

  • 存在多个有效答案
  • Multiple valid answers exist
  • 必须消除歧义。
  • Ambiguities must be resolved
  • 需要逻辑顺序
  • Logical sequences are required
  • 准确性和问责制至关重要
  • Accuracy and accountability are paramount

推理弥补了这一不足,它使模型能够进行“思考”,评估中间步骤,模拟决策,并对结果进行解释。这在处理复杂的多跳查询时至关重要(例如,在员工年薪超过 9 万美元的部门中,哪个部门的预算最高?)。如果没有推理,模型可能会猜测或跳过某些步骤;而有了推理,它就可以分解查询,识别子任务,并按顺序解决它们。

Reasoning fills this gap by enabling models to think out loud, evaluating intermediate steps, simulating decisions, and justifying outcomes. This is essential when working with complex, multi-hop queries (e.g., which department has the highest budget among those where employees earn over $90,000?). Without reasoning, the model may guess or skip steps; with reasoning, it can break down the query, identify subtasks, and solve them sequentially.

人工智能系统中的信任和可解释性

Trust and explainability in AI systems

GenAI(基因人工智能)应用的关键挑战在于信任。在商业、法律、医疗和教育领域,用户不仅要求系统正确,还要求系统能够解释其决策。推理能力通过以下方式提高可解释性:

A key challenge in GenAI adoption is trust. In business, law, medicine, and education, users demand systems that are not only correct but also explain their decisions. Reasoning improves explainability by:

  • 使思维过程透明化(例如,通过 CoT 或 ReAct 风格的输出)。
  • Making thought processes transparent (e.g., through CoT or ReAct-style outputs).
  • 允许用户审核中间步骤。
  • Allowing human users to audit intermediate steps.
  • 支持根据已知规则或约束进行验证。
  • Supporting validation against known rules or constraints.

例如,在法律文件分析中,GenAI模型不仅应该概括合同条款,还应该逐步解释为什么某个条款被认为存在风险。这种程度的问责性只有通过推理才能实现。

For example, in legal document analysis, a GenAI model should not only summarize a contract clause but explain why a clause is considered risky, step-by-step. This level of accountability is only possible through reasoning.

处理歧义和消除歧义

Handling ambiguity and disambiguation

在现实世界的语言和视觉任务中,歧义很常见。同一个词根据上下文可能指代不同的事物(例如,苹果公司和苹果水果;员工姓名和部门名称)。推理能够实现以下功能:

In real-world language and vision tasks, ambiguity is common. The same term may refer to different things based on context (Apple as a company vs. a fruit; a name in an employee vs. a department table). Reasoning enables the following:

  • 基于图式、视觉线索或周围语境的消歧义。
  • Disambiguation based on schema, visual clues, or surrounding context.
  • 寻求澄清的行为(例如,您指的是员工姓名还是部门名称?)。
  • Clarification-seeking behavior (e.g., do you mean the employee's name or the department's name?).
  • 当无法确定地消除歧义时,采用安全默认值或概率排序。
  • Safe defaults or probabilistic ranking when ambiguity cannot be resolved with certainty.

在多模态GenAI中,这一点变得尤为重要。例如,如果模型要回答关于图表或图像的问题,它必须将视觉线索与文本意图相结合,并运用逻辑来推断用户的真正意图。

In multimodal GenAI, this becomes even more critical. For example, if a model is answering a question about a chart or an image, it must combine visual cues with textual intent and use logic to infer what the user likely means.

多模式整合需要逻辑组合

Multimodal integration requires logical composition

GenAI 的真正实力在于其处理多模态输入的能力——文本、图像、文档、代码、表格,甚至音频。然而,这些模态具有不同的结构和语义。推理对于以下方面至关重要:

The true power of GenAI lies in its ability to handle multimodal inputs—text, images, documents, code, tables, and even audio. However, these modalities come with diverse structures and semantics. Reasoning is essential for:

  • 对齐模态(例如,将图像标题与特定感兴趣区域关联起来)。
  • Aligning modalities (e.g., linking an image caption with a specific region of interest).
  • 编写多步骤解释(例如,解释图表并参考文字说明)。
  • Composing multi-step interpretations (e.g., interpreting a diagram and referencing a textual explanation).
  • 进行跨多个模式的推断(例如,将图表中的数据与政策文件中规定的条件进行比较)。
  • Making deductions that span multiple modalities (e.g., comparing data in a chart with conditions stated in a policy document).

如果一个模型只能嵌入和检索多模态数据,而不能跨格式推理和推断缺失的链接,那么这个模型就走不远。

A model that simply embeds and retrieves multimodal data cannot go far without the capacity to reason across formats and infer missing links.

快捷工程和CoT推理

Prompt engineering and CoT reasoning

诸如 CoT 和 ReAct 之类的高级提示策略是 GenAI 中推理的直接实现。这些提示鼓励模型:

Advanced prompting strategies like CoT and ReAct are direct implementations of reasoning in GenAI. These prompts encourage the model to:

  • 将任务分解成逻辑步骤。
  • Decompose tasks into logical steps.
  • 先解决子问题,再合并答案。
  • Solve sub-problems before combining answers.
  • 行动前需进行论证。
  • Justify actions before execution.

例如,在将自然语言查询转换为 SQL 时,支持 CoT 的模型可以首先推断哪些表和列是相关的,然后再构建查询。这显著提高了正确性并减少了错误。

For instance, when converting a natural language query into SQL, a CoT-enabled model can first reason about which tables and columns are relevant, then construct the query. This dramatically improves correctness and reduces hallucinations.

此外,少样本CoT提示实验表明,即使是大型语言模型LLM )也能从逐步推理的示例中获益。这与人类学习过程相呼应,并强化了推理不仅仅是一种技巧,而是一种能够提升表现的认知支架这一观点。

Moreover, few-shot CoT prompting shows that even large language models (LLMs) benefit from seeing examples of step-by-step reasoning. This mirrors human learning and reinforces the idea that reasoning is not just a technique; it is a cognitive scaffolding that improves performance.

重新排序和元推理

Reranking and meta-reasoning

在实践中,GenAI 系统通常会生成多个候选答案。如果没有推理,选择最佳答案就变得随意或基于嵌入。通过推理和重排序,系统可以:

In practice, GenAI systems often generate multiple candidate responses. Without reasoning, choosing the best one becomes arbitrary or embedding-based. With reasoning + reranking, systems can:

  • 评估答案的合理性或事实性。
  • Evaluate the answer's plausibility or factuality.
  • 根据内部一致性对答案进行评分。
  • Score answers based on internal consistency.
  • 应用约束条件(例如,此答案是否与检索到的事实一致?)。
  • Apply constraints (e.g., is this answer consistent with retrieved facts?).

这种元推理,即对已生成响应进行推理,对于减少幻觉和提高可靠性至关重要。它在高风险决策系统中尤为重要,例如医疗保健或金融领域的AI助手。

This meta-reasoning, reasoning about generated responses, is critical for reducing hallucinations and improving reliability. It is especially important in high-stakes decision-making systems, such as AI assistants in healthcare or finance.

学习可推广的策略

Learning generalizable strategies

推理能力有助于模型泛化到新的任务。经过推理训练的模型通常可以:

Reasoning helps models generalize to novel tasks. A model trained to reason can often:

  • 将熟悉的解决问题框架应用于新问题。
  • Apply familiar problem-solving frameworks to new questions.
  • 运用类比推理(这和另一个问题很像…… )。
  • Use analogical reasoning (this is like that other problem...).
  • 适应以前从未见过的指令或目标。
  • Adapt to instructions or goals that have not been seen before.

缺乏推理能力,该模型只能进行表面记忆。而有了推理能力,它就能开始接近解决问题的智能——这是真正通用人工智能的标志。

Without reasoning, the model is limited to surface memorization. With reasoning, it begins to approximate problem-solving intelligence—a hallmark of true general-purpose AI.

人机协作

Human-AI collaboration

推理不仅有助于机器,也有助于与机器协作的人类。当 GenAI 模型解释其步骤时,用户可以:

Reasoning not only helps the machine; it also helps the human collaborating with it. When GenAI models explain their steps, users can:

  • 及早发现错误。
  • Identify mistakes early.
  • 提供更正或提示。
  • Provide corrections or nudges.
  • 深入了解复杂问题。
  • Gain insights into complex problems.

这在人工智能辅助领域专家的副驾驶场景中尤为有用。例如,在数据科学领域,能够推理完成探索性数据分析EDA )步骤的GenAI代理可以帮助分析师在保持控制的同时加快发现速度。

This is particularly useful in co-pilot scenarios where AI assists a domain expert. For example, in data science, a GenAI agent that can reason through exploratory data analysis (EDA) steps helps analysts speed up discovery while staying in control.

智能人工智能基金会

Foundation for agentic AI

随着我们迈向智能体系统——即能够自主规划、行动和反思的人工智能体——推理成为其基础。这些智能体必须:

As we move toward agentic systems—AI agents that plan, act, and reflect autonomously—reasoning becomes the foundation. These agents must:

  • 方案
  • Form plans
  • 选择工具
  • Select tools
  • 对结果做出反应
  • React to outcomes
  • 根据上下文重试或修改
  • Retry or revise based on context

这些行为都依赖于推理层。没有推理层,智能体就只是随机试错的机器;而有了推理层,它们就能成为适应性强、智能化的助手。

Every one of these actions depends on a reasoning layer. Without it, agents are random trial-and-error engines, but with reasoning, they become adaptable, intelligent assistants.

在人工智能系统日益融入工作流程、决策制定和用户交互的时代,推理不再是可有可无的,而是至关重要的。它将被动的生成器转变为主动的问题解决者,为人工智能交互带来清晰性、准确性、适应性和可信度。无论是通过认知能力提示、反应循环还是多模态推理链,这种能力都使人工智能能够处理模糊情况、规划行动、解释决策并与人类进行有意义的协作。

In an era where GenAI systems are increasingly integrated into workflows, decision-making, and user interactions, reasoning is not optional; it is essential. It transforms passive generators into active problem-solvers. It brings clarity, accuracy, adaptability, and trustworthiness into AI interactions. Whether through CoT prompting, ReAct loops, or multimodal reasoning chains, this capability enables AI to handle ambiguity, plan actions, explain decisions, and collaborate meaningfully with humans.

随着我们构建更先进的GenAI系统,推理成为生成与智能之间的桥梁。而跨越这座桥梁,便能开启人工智能的下一个前沿领域。

As we build more advanced GenAI systems, reasoning is the bridge between generation and intelligence. And crossing that bridge is what unlocks the next frontier of AI.

GenAI中的推理及其类型

Reasoning in GenAI and their types

人类人工智能(GenAI)系统,尤其是语言学习模型(LLM)和人工智能代理,其设计目标正日益转向深入思考问题,而不仅仅是生成表面文本。现代语言学习模型,例如GPT-4和PaLM,能够模仿各种推理模式(从严格的逻辑到常识),从而得出结论或做出决策。然而,尽管它们在模式识别和流畅模仿方面表现出色,但真正的推理(逻辑地连接信息、推断未知事实、解决新问题)仍然是一个挑战。研究人员正积极利用诸如CoT提示、ReAct代理和多模态融合架构等技术来增强语言学习模型的推理能力,以使人工智能的思维更接近人类,更具鲁棒性。下一节将详细概述人类人工智能中的关键推理类型,包括它们的定义、示例以及当前系统如何实现它们以提高性能、决策能力、消除歧义和增强鲁棒性。

GenAI systems, especially LLMs and AI agents, are increasingly being designed to think through problems rather than just produce surface-level text. Modern LLMs like GPT-4 and PaLM can mimic various reasoning patterns (from strict logic to commonsense) to draw conclusions or make decisions. However, while they excel at pattern recognition and fluent imitation, true reasoning (logically connecting information, inferring unseen facts, solving novel problems) is still a challenge. Researchers are actively enhancing LLM reasoning via techniques like CoT prompting, ReAct agents, and multimodal fusion architectures to make AI’s thinking more human-like and robust. An overview of key types of reasoning in GenAI has been thoroughly explained in the following section, and it covers what they are, examples of each, and how current systems implement them to improve performance, decision-making, disambiguation, and robustness.

人工智能中的演绎推理

Deductive reasoning in AI

演绎推理是指从一般前提或规则中得出具体且逻辑上确定的结论的过程。如果给定的前提为真,则演绎结论也必然为真。例如,从“所有鲸鱼都是哺乳动物”“虎鲸是鲸鱼”这两个前提,演绎系统可以得出“虎鲸是哺乳动物”的结论。逻辑学习模型(LLM)可以通过遵循“如果……那么……”的规则并执行逐步推理来模拟演绎逻辑。在实践中,认知能力训练(CoT)的提示通常会培养一种演绎推理风格;模型会被提示将问题分解成逻辑步骤,并系统地推导出答案。这对于形式逻辑谜题或算术等任务非常有效,因为在这些任务中,答案必然由前提推导而来。通过显式地生成中间步骤,LLM 的答案更有可能在逻辑上有效,并且可以追溯到输入的事实,从而提高其在数学证明或代码推理等领域的可靠性。演绎推理通过确保结论与给定事实一致,有助于做出稳健的决策,从而减少对正确性要求严格的任务中的错误。

Deductive reasoning is the process of drawing specific, logically certain conclusions from general premises or rules. If the given premises are true, a deductive conclusion must also be true. For example, from all whales are mammals and Orca is a whale, a deductive system concludes Orca is a mammal. LLMs can emulate deductive logic by following if-then rules and performing step-by-step inference. In practice, CoT prompting often instils a deductive style; the model is prompted to break a problem into logical steps and derive the answer systematically. This has been effective for tasks like formal logic puzzles or arithmetic, where the solution follows inevitably from the premises. By explicitly generating intermediate steps, an LLM’s answer is more likely to be logically valid and traceable to the input facts, which improves reliability in domains like math proofs or code reasoning. Deductive reasoning contributes to robust decision-making by ensuring conclusions are consistent with given facts, reducing mistakes in tasks that demand rigorous correctness.

在 GenAI 中实现的CoT提示是一种直接激发演绎思维的方法。例如,给定一个数学应用题或逻辑谜题,像 GPT-4 这样的模型会被鼓励列出前提并推断每个步骤,最终得出答案,这与证明过程非常相似。这种方法显著提高了多步骤逻辑和数学任务的准确率。一些神经符号系统甚至将逻辑推理模型 (LLM) 与自动定理证明器相结合,以复核演绎步骤,融合统计推理和形式推理,从而提高推理的严谨性。

Implemented in GenAI: CoT prompts are a direct way to elicit deductive thinking. For instance, given a math word problem or a logical riddle, models like GPT-4 are encouraged to list premises and infer each step before finalizing the answer, much like a proof. This method significantly boosts accuracy on multi-step logic and math tasks. Some neuro-symbolic systems even combine LLMs with automated theorem provers to double-check deductive steps, blending statistical and formal reasoning for extra rigor.

人工智能中的归纳推理

Inductive reasoning in AI

归纳推理是指从具体实例或证据概括出更广泛的规则或结论。其结果是概率性的而非必然的;本质上是从示例中学习模式。用人类的例子来说,如果你观察过去 10 次在添加某个补丁后成功的代码构建,你可能会归纳地得出结论:这个补丁通常可以修复构建问题。LLM (逻辑学习模型)由于其训练方式而具有强大的归纳推理能力:它们会吸收数百万个示例并学习预测模式。提示中的小样本学习就是一个典型的例子;LLM(学习学习模型)被赋予少量输入/输出I/O )示例(具体案例),并推断出适用于新查询的通用模式。LLM中的上下文学习通常被描述为归纳推理,因为模型从提示示例中抽象出规则,并将其扩展到解决新的实例。这有助于提高泛化能力和适应性。例如,如果向模型展示几个格式化的日期转换示例,模型可以推断出格式化规则,并在无需显式编程的情况下转换新的日期。归纳推理提高了创造性思维和模式识别能力,但也可能引入不确定性——结论看似合理但并非确定,因此模型有时必须通过额外的检查来验证归纳猜测。

Inductive reasoning involves generalizing from specific instances or evidence to broader rules or conclusions. The outcome is probable rather than guaranteed; it is essentially pattern learning from examples. In human terms, if you observe the past 10 code builds that succeeded after adding a certain patch, you might inductively conclude that this patch generally fixes the build. LLMs are inherently strong inductive reasoners because of the way they are trained: They ingest millions of examples and learn to predict patterns. Few-shot learning in prompts is a prime example; an LLM is given a handful of input/output (I/O) examples (specific cases), and it infers the general pattern to apply to a new query. In-context learning in LLMs is often described as inductive reasoning, as the model abstracts a rule from the prompt examples and extends it to solve a novel instance. This contributes to generalization and adaptability. For example, if shown a couple of formatted date conversions, the model can induce the formatting rule and convert a new date without explicit programming. Inductive reasoning improves creative generation and pattern recognition, but it can also introduce uncertainty—the conclusions are plausible but not certain, so models must sometimes verify inductive guesses with additional checks.

在 GenAI 中实现LLM 主要通过从数据中学习和少量提示来实现归纳。归纳并非一种特殊的提示技术,而是模型在大量文本上训练并适应给定示例的自然副产品。例如,GPT 类型的模型可以从少量示例推断出列表排序规则或语法模式,并继续进行,展现出归纳泛化能力。自洽性技术可以通过让模型生成多个合理的答案,然后选择最常见或最一致的答案来增强归纳能力,从而有效地考虑多个归纳假设并选择最佳答案。

Implemented in GenAI: LLMs implement induction largely via learning from data and few-shot prompting. Rather than a special prompting technique, induction is a natural by-product of training on vast text and adjusting to given examples. For instance, GPT-style models can infer a list sorting rule or grammatical pattern from a few demonstrations and then continue it, showcasing inductive generalization. Self-consistency techniques can augment induction by having the model-generate multiple plausible answers and then choose the most common or consistent one, effectively considering several inductive hypotheses and selecting the best.

人工智能中的溯因推理

Abductive reasoning in AI

溯因推理是指在不完整的观察结果下,推断出一个合理的假设,从而得出最佳解释。侦探常用的推理方法就是溯因推理。例如,如果我们看到窗边有脚印,保险箱又开着,那么最佳解释就是入室盗窃。与演绎推理不同,溯因推理的结论并不能保证正确,它们只是基于已有知识的推测。逻辑逻辑模型(LLM)可以在需要填补信息空白或推断隐藏原因的任务中运用溯因推理。例如,给定一个不完整的故事,人工智能可以推测出最能解释某个角色行为的动机。溯因推理对于常识推理和故障排除非常有用,因为在这些情况下,可能存在多种解释,系统必须选择最有可能的解释。在GenAI中,实现溯因推理的一种方法是使用“提出并验证”的CoT(认知能力验证)。模型首先提出一个假设,然后在内部检查该假设是否符合现有证据。研究表明,LLM(逻辑逻辑模型)可以从这种方法中获益:例如,将多项选择题视为溯因推理任务,先假设一个答案,然后检验其在上下文中的合理性,通常能得到更好的结果。当直接推理困难时,人类会自然而然地转向溯因推理,而LL​​M智能体也开始模仿这种灵活性。通过融入溯因推理,人工智能系统能够更好地应对歧义,因为它们可以处理不完整的信息,并仍然提供合理的解决方案。

Abductive reasoning is reasoning to the best explanation, forming a plausible hypothesis given incomplete observation. It is the kind of reasoning a detective uses. If we see footprints by the window and the safe open, the best explanation is a burglary. Unlike deduction, abductive conclusions are not guaranteed to be true; they are educated guesses. LLMs can perform abductive reasoning in tasks where they must fill in gaps or infer hidden causes. For example, given a partial story, an AI might guess a character’s motive that best explains their actions. Abductive reasoning is valuable for commonsense inference and troubleshooting, where multiple explanations exist and the system must pick the most likely. In GenAI, one way to implement this is via a propose-and-verify CoT. The model first posits a hypothesis, then internally checks if it fits the evidence. Studies show that LLMs can benefit from this approach: for instance, treating a multiple-choice question as an abductive task, hypothesizing an answer, and then seeing if it makes sense in context often yields better results. Humans naturally switch to abductive reasoning when direct deduction is hard, and LLM agents are starting to mimic that flexibility. By incorporating abductive reasoning, AI systems become more robust to ambiguity, as they can handle incomplete information and still offer a reasonable solution.

在 GenAI 中研究人员探索了如何通过明确的提示引导逻辑逻辑模型 (LLM) 思考可能的解释。例如,给定一个谜语或诊断问题,模型可以被引导列举潜在原因,最终得出最合理的解释。一些智能体框架通过生成假设并使用工具(例如知识库)进行验证,然后再最终确定答案,从而实现溯因推理策略。这一过程类似于先假设后检验。这种方法在诊断任务(医疗或技术)中非常有用,人工智能可以先提出症状的原因,然后将其与已知事实进行比对,从而提高不确定性下的决策能力。

Implemented in GenAI: Researchers have explored prompts that explicitly tell the LLM to think of possible explanations. For example, given a riddle or a diagnostic question, the model might be guided to enumerate potential reasons and then conclude with the most plausible one. Some agent frameworks implement abductive strategies by generating a hypothesis and using a tool (like a knowledge lookup) to verify it before finalizing the answer, a process akin to hypothesize, then test. This approach is useful in diagnosis tasks (medical or technical) where the AI suggests a cause for symptoms and then checks consistency with known facts, improving decision-making under uncertainty.

人工智能中的类比推理

Analogical reasoning in AI

类比推理是指通过比较相似情境或结构来推断结论。本质上,人工智能利用类比,如果两个事物之间存在某些关系,那么对其中一个事物的了解可以帮助我们理解另一个事物。一个经典的例子是解决“鸟之于天空,如同鱼之于___?”这样的类比题。模型必须理解自身所处的关系,并发现鱼生活在水中。逻辑逻辑模型(LLM)能够处理简单的类比题,因为它们在训练过程中接触过大量的词语关系(同义词、类别等)。例如,GPT-4 可以通过识别功能类比来完成“刀之于切割,如同钢笔之于___,即写”这样的句子。除了文字谜题之外,类比推理还允许人工智能通过识别结构相似性,将已知的解决方案应用于新问题。逻辑逻辑模型智能体可以通过回忆一个已知的类似场景来处理新任务,然后将解决方案映射到该场景上。这有助于创造性地解决问题和消除歧义。如果指令不明确,人工智能可以从提示或记忆中回忆一个类似的例子来正确理解指令。然而,当类比抽象或需要现实世界经验时,类比推理就可能面临挑战。目前的全人类人工智能系统大多以隐式方式(通过学习语言模式)实现类比,但一些研究正在探索如何使其更加显式化。一种方法是引导模型识别一对事物之间的关系,然后将其应用于另一对事物,从而强制进行类比推理。

Analogical reasoning involves drawing parallels between similar situations or structures to infer a conclusion. In essence, the AI uses an analogy that if two things share some relationships, then knowledge about one can inform understanding of the other. A classic example is solving analogies like bird is to sky as fish is to ___? The model must see the relation it lives in and find that fish live in water. LLMs can handle simple analogies because they have seen many word relationships (synonyms, categories, etc.) during training. For instance, GPT-4 can complete knife is to cut as pen is to ___, with write by recognizing the functional analogy. Beyond word puzzles, analogical reasoning lets AI apply known solutions to new problems by recognizing structural similarity. An LLM agent might approach a new task by recalling a scenario it knows that is analogous, then mapping the solution over. This contributes to creative problem-solving and disambiguation. f an instruction is unclear, the AI might recall an analogous example from its prompt or memory to interpret it correctly. However, analogical reasoning can be challenging when the analogy is abstract or requires real-world experience. Current GenAI systems implement analogies mostly implicitly (through learned language patterns), but research is emerging to make this more explicit. One approach directs the model to identify the relationship in one pair and then apply it to another, thereby forcing an analogical CoT.

在 GenAI 中实现:类比推理虽然不像其他推理类型那样经常被强调,但它存在于诸如隐喻理解或学术能力测验( SAT ) 式的类比题等任务中。提示策略可以通过询问“这种情况与已知场景有何相似之处?”来鼓励类比推理。一些实验方法会给逻辑推理模型 (LLM) 提供一些类比示例供其参考。例如,提示可以显示:“巴黎之于法国,正如东京之于日本(国家-首都关系)” ,然后要求模型将这种关系应用于新的一对事物。通过这种方式,模型可以明确地寻找类比关系。鼓励类比推理有助于知识迁移。例如,多模态智能体可以推断握铅笔与握画笔类似,从而迁移运动技能;或者,逻辑推理模型可以通过回忆类似谜题的解法来解决谜题。最近的研究甚至训练元模型,为给定的问题选择最佳推理方式(演绎推理、溯因推理或类比推理),这表明添加类比思维可以扩大可解决任务的范围。

Implemented in GenAI: It is not commonly highlighted as other reasoning types, but analogical reasoning is present in tasks like metaphor understanding or Scholastic Aptitude Test (SAT)-style analogy questions. Prompting strategies can encourage analogy by asking, how is this situation similar to a known scenario? Some experimental methods give the LLM examples of analogies to follow. For example, a prompt might show: Paris is to France as Tokyo is to Japan (country-capital relationship), and then ask the model to apply that relation to a new pair. By doing so, the model explicitly searches for the analogous relationship. Encouraging analogies helps in knowledge transfer. For instance, a multimodal agent could reason that holding a pencil is analogous to holding a paintbrush to transfer motor skills, or an LLM could solve a puzzle by recalling a similar puzzle’s solution format. Recent work even trains meta-models to pick the best reasoning style (deductive vs. abductive vs. analogical) for a given problem, illustrating that adding analogical thinking can expand the range of solvable tasks.

常识推理

Commonsense reasoning

常识推理是指人工智能运用日常世界知识和人类习以为常的逻辑的能力。这包括基本事实(水是湿的)、时空常识(人不会穿墙而过)、社会规范以及典型情境中的因果关系。它对于理解隐含意义和避免给出荒谬的答案至关重要。逻辑学习模型(LLM)从训练文本中学习到大量的常识,但它们并非总能可靠地运用这些常识。例如,一个朴素模型可能会回答“大象能穿过门吗?”这个问题。如果大象能挤过去,答案可能是肯定的,这表明它缺乏常识性的物理推理能力。而通过适当的技术,生成模型可以进行推理。例如,大象体型太大,普通门进不去,所以答案应该是“否”。一种成功的方法是使用常识提示(CoT)来注入常识,例如通过逐步引导模型完成某个场景,可以在每一步提醒模型相关的常识知识。事实上,研究发现,通过让模型在回答问题之前阐明因果关系和世界知识,常识提示可以提高模型在常识问答任务上的表现。例如,假设正在下雨,约翰把伞忘在家里了。当他走到外面时会发生什么?常识提示可能会明确指出正在下雨,而且没有伞,约翰会被淋湿,从而得出约翰会被淋湿的答案。常识推理极大地帮助消除歧义;它帮助人工智能选择在上下文中合理的解释(例如,理解习语,根据合理的意图解析代词)。现代逻辑逻辑模型(LLM)还会利用外部知识库或常识工具:如果遇到不确定的情况,智能体可以查询事实数据库(例如询问大象是否能通过门),以避免犯低级错误。通过融入常识,人工智能系统变得更加稳健,更符合人类的预期,从而提升其在开放式现实场景中的决策能力。

Commonsense reasoning is the ability of an AI to use everyday world knowledge and obvious logic that humans take for granted. This includes basic facts (water is wet), spatial-temporal common sense (people do not walk through walls), social norms, and cause-and-effect in typical situations. It is crucial for understanding implicit meanings and avoiding nonsensical answers. LLMs learn a great deal of common sense from their training text, but they may not always apply it reliably. For example, a naive model might answer the question, can an elephant fit through a doorway? The answer can be yes, if it squeezes, demonstrating a lack of commonsense physical reasoning. With proper techniques, generative models can reason that an elephant is too large for a standard door, so the answer should be no. One successful approach is using a CoT to inject commonsense, such as by walking through a scenario step-by-step, the model can be reminded of common knowledge at each step. Indeed, CoT prompting has been found to improve performance on commonsense QA tasks by letting the model articulate cause-and-effect and world knowledge before answering. For example, given that it is raining and John left his umbrella at home. What will happen when he walks outside? A CoT might explicitly note that it is raining, and without an umbrella, John will get wet, leading to the answer that John will get soaked. Commonsense reasoning greatly aids disambiguation; it helps an AI choose interpretations that make sense in context (e.g., understanding idioms, resolving pronouns by plausible intent). Modern LLMs also leverage external knowledge bases or tools used for commonsense: if unsure, an agent can query a fact database (like asking if elephants fit through doors) to avoid silly mistakes. By building in commonsense, AI systems become more robust and aligned with human expectations, improving their decision-making in open-ended real-world scenarios.

在 GenAI 中实现常识推理通常可以通过提示工程和在专用数据上进行微调来增强。像 CommonsenseQA 或 StrategyQA 这样的数据集通过日常推理问题训练模型,从而提高其对物理和社会逻辑的内在理解。在提示中,开发者可能会包含一些显而易见的事实陈述(例如:提醒:大象比门大),以此来引导模型。CoT 非常有用,因为像 GPT-4 这样的模型可以在回答问题之前被提示解释某个场景(例如:杯子从桌子上掉下来了,所以它很可能碎了,因为杯子很脆弱),从而确保它们考虑了常识。另一种方法是检索增强:如果一个问题需要常识知识(例如,大象能穿过门吗),LLM 智能体可以使用搜索工具来查找大象的典型体型或已知事实。这种工具增强的推理方式模拟了人类回忆事实或查阅参考资料的方式,从而得出既正确又合理的答案。通过将固有的模型知识与外部信息和明确的推理步骤相结合,当前的 AI 系统比前几代系统能够更好地处理常识性查询。

Implemented in GenAI: Commonsense reasoning is often enhanced by prompt engineering and fine-tuning on specialized data. Datasets like CommonsenseQA or StrategyQA train models on everyday reasoning questions, improving their internal grasp of physical and social logic. In prompts, developers might include statements of obvious facts (Reminder: elephants are bigger than doors) to cue the model. CoT is helpful as models like GPT-4 can be prompted to explain a scenario (the cup fell off the table, so it likely broke because cups are fragile) before answering, ensuring they consider general knowledge. Another approach is retrieval-augmentation: if a question needs commonsense knowledge (e.g., do elephants fit through doors), an LLM agent can use a search tool to check typical elephant sizes or known facts. This tool-augmented reasoning mimics how humans recall facts or consult references, leading to answers that are both correct and make sense. By combining innate model knowledge with external information and explicit reasoning steps, current AI systems handle commonsense queries much better than earlier generations.

因果推理

Causal reasoning

因果推理是指通过识别事件的起因或预测其结果来理解因果关系的能力。例如,具备因果推理能力的人工智能可以推断出玻璃杯掉在坚硬的地面上会破碎,或者反过来推断出下雨后街道会湿滑。这种推理对于规划和预测任务至关重要。在通用人工智能(GenAI)中,当模型需要推理某事发生的原因或假设情景时,因果推理就发挥作用了。逻辑逻辑模型(LLM)有时可以通过依赖模式来推断因果关系(例如,文本中常见的“下雨导致街道湿滑”),但真正的因果推理很难,因为数据中的相关性并不总是意味着因果关系。为了改进这一点,可以使用CoT提示让模型明确地考虑因果链:例如,X发生了,导致Y,Y又导致Z。通过列举这些链接,模型可以避免逻辑跳跃。决策的一个有趣优势在于,具备因果推理能力的智能体能够预见其行为的结果(这在规划任务或游戏环境中非常有用)。例如,一个用于机器人规划的逻辑逻辑模型(LLM)可能会这样推理:如果我打翻花瓶,它会摔碎,让用户难过,所以我应该避免这种情况。这种正向模拟就是因果推理的体现。它还有助于消除歧义;例如,考虑这样一个问题:早上草坪是湿的。可能的原因是什么?一个具备因果推理能力的逻辑学习模型(LLM)可以提出,也许是昨晚下雨了,或者洒水器运行了,其目的是应用现实世界中关于典型原因的知识。一些专门的基准测试(例如 CLadder、CausalQA)会测试 LLM 对因果关系的理解能力,结果表明,带有推理提示的大型模型比随机猜测更能准确地识别因果关系。然而,纯粹基于文本的模型仍然可能被表面线索所迷惑,因此研究人员会整合因果图或结构化知识来增强这种能力。因果推理最终通过确保人工智能的行为和答案能够逻辑地从原因中得出,从而增强其鲁棒性,并且能够更可靠地处理假设性问题。

Causal reasoning is the ability to understand cause-and-effect relationships by identifying what leads to an event or predicting its outcomes. For example, an AI capable of causal reasoning can infer that a glass falling on a hard floor | shatters, or conversely, reason that it rained | the street is wet. This type of reasoning is vital for planning and prediction tasks. In GenAI, causal reasoning comes into play when models need to reason about why something happened or what-if scenarios. LLMs can sometimes infer causal links by relying on patterns (rain leads to wet streets is common in text), but true causal inference is hard because correlation in data is not always causation. To improve this, CoT prompting can be used to have the model explicitly consider causal chains: e.g., X happened, which would cause Y, which in turn causes Z. By enumerating these links, the model can avoid logical leaps. One interesting benefit of decision-making is that an agent with causal reasoning will foresee the outcome of its actions (useful in planning tasks or game environments). For instance, a robot-planning LLM might reason: if I knock over the vase, it will break and upset the user, so I should avoid that. This forward simulation is causal reasoning at work. It also aids disambiguation; consider a question like, the lawn is wet in the morning. What might be the cause? A causal reasoning LLM can propose, maybe it rained overnight, or the sprinkler ran, intending to apply real-world knowledge of typical causes. Some specialized benchmarks (e.g. CLadder, CausalQA) test LLMs on cause-effect understanding, and results show that larger models with reasoning prompts can identify causal relations more often than chance. Still, purely text-based models can be fooled by surface cues, so researchers integrate causal diagrams or structured knowledge to solidify this ability. Causal reasoning ultimately contributes to an AI’s robustness by ensuring its actions and answers follow logically from causes, and it can handle what-if questions more reliably.

在 GenAI 中实现现有系统通过提示和架构相结合的方式增强因果推理能力。在提示方面,诸如反事实提示之类的技术会要求模型设想不同的原因并检查其一致性(如果 X 没有发生,Y 还会发生吗?)。这促使逻辑逻辑模型 (LLM) 区分单纯的相关性和实际的依赖关系。CoT 可以明确地提示:让我们逐步分析因果链。在架构方面,一些方法将文本转换为结构化形式,例如因果图,然后基于这些结构进行推理。例如,可以引导 LLM 阅读一段文字,提取事件及其时间顺序或因果关系,从而形成一个小型知识图谱。然后,它可以通过内部模块或生成逻辑解释来推理该图谱,以回答问题或做出决策。这种方法通过将文本转换为时间线图,然后借助 CoT 在该图谱上进行推理,从而改进了时间推理和因果推理能力。此外,借助工具增强的智能体可以通过查询因果数据库或运行模拟来进行因果推理。例如,人工智能可以使用物理引擎工具来预测物理行为的结果,从而使其因果预测与现实相符。所有这些实现方式都旨在确保人工智能不仅知道发生了什么,而且还知道为什么会发生,从而使其行为更加可靠和可解释。

Implemented in GenAI: Current systems enhance causal reasoning through a mix of prompting and architecture. On the prompting side, techniques like counterfactual prompts ask the model to imagine different causes and check consistency (If X had not happened, would Y still happen?). This encourages the LLM to distinguish mere correlation from actual dependency. CoT can explicitly prompt: let us analyze the causal chain step-by-step. On the architecture side, some approaches convert text into structured forms like causal graphs and then reason over them. For example, an LLM can be guided to read a paragraph and extract events and their temporal order or causal links, forming a mini knowledge graph. It might then reason over this graph (either with an internal module or by generating a logical explanation) to answer a question or make a decision. Such a method was used to improve temporal and causal reasoning by translating text into a timeline graph and then performing reasoning with the help of CoT on that graph. Additionally, tool-augmented agents can do causal reasoning by querying cause-effect databases or running simulations. For instance, an AI might use a physics engine tool to predict outcomes of physical actions, thereby grounding its causal predictions in reality. All these implementations aim to ensure the AI not only knows that something happens, but why, thereby making its behavior more reliable and interpretable.

空间推理

Spatial reasoning

空间推理是指对空间、几何形状和物理布局进行推理的能力:理解左右、上下、距离等关系,以及物体如何相互契合。对人类而言,这种能力支撑着从打包行李到导航等各种任务。对于人工智能来说,空间推理可以指阅读场景的文本描述并确定空间关系,或者观察图像并理解物体的排列方式。仅凭文本输入的语言学习模型(LLM)通常难以处理用语言描述的复杂空间问题。例如,一个基于文本的谜题可能会说红球在蓝球左侧两个位置,而蓝球并非位于最左侧,然后询问红球在哪个位置。如果没有图示,模型必须模拟一个心理地图。生成模型在处理此类任务时一直面临困难,因为仅凭语言很难追踪多个相对位置。然而,研究人员开发了一些提示策略来辅助解决这些问题。一种有效的方法是符号链CoS )提示,它让模型在推理之前将空间描述转换为简化的符号表示(例如网格或坐标列表)。通过使用符号(例如,通过对物体和位置进行缩写表示,该模型可以在内部绘制心理地图,然后回答相关问题。这种方法极大地提高了空间任务(例如规划和导航指令)的准确性。例如,在一个例子中,模型被要求判断一个物品清单中有多少是蔬菜(这需要它识别并计数物品)。通过将物品表示为类别字典,LLM 可以逐步计数蔬菜并得出正确答案 (7)。空间推理对于多模态智能体(例如机器人或视觉语言模型( VLM ))至关重要,因为它们必须解释现实世界的布局。空间推理有助于提高性能的稳健性,避免出现荒谬的输出(如果不可能,具有空间感知能力的 AI 不会说猫在封闭的盒子里),并实现更好的规划(了解一个物体相对于另一个物体的位置)。

Spatial reasoning is the capacity to reason about space, geometry, and physical layouts: understanding relationships like left-right, above-below, distances, or how objects fit together. In humans, this underpins tasks from packing a suitcase to navigating a route. For AI, spatial reasoning can mean reading a textual description of a scene and determining spatial relations or looking at an image and understanding object arrangements. LLMs on their own (with only text input) often struggle with complex spatial problems described in language. For example, a text-based puzzle might say the red ball is two spots to the left of the blue ball, which is not at the leftmost position, and ask which spot the red ball is in. Without a diagram, the model must simulate a mental map. Generative models have had difficulty with such tasks because keeping track of multiple relative positions is challenging in pure language form. However, researchers developed prompting strategies to help. One effective method is Chain-of-Symbol (CoS) prompting, which has the model convert spatial descriptions into a simplified symbolic representation (like a grid or list of coordinates) before reasoning. By using symbols (e.g., abbreviations for objects and positions), the model can internally draw a mental map and then answer questions about it. This approach greatly improved accuracy on spatial tasks such as planning and navigation instructions. For instance, in one example, the model was asked about a list of items and had to figure out how many were vegetables (requiring it to identify items and count them). By representing the items as a dictionary of categories, the LLM could count vegetables and arrive at the correct answer (7) in a step-by-step manner. Spatial reasoning is crucial for multimodal agents (like robots or vision-language models (VLMs)) because they must interpret real-world layouts. It contributes to robust performance by preventing absurd outputs (an AI with spatial sense will not say the cat is inside the closed box if not possible) and allowing better planning (knowing an object’s location relative to another).

在 GenAI 中空间推理的实现方式包括专门的提示和多模态模型设计。如前所述,在文本方面,CoT 可以整合符号:例如,提示可以指示“让我们使用坐标,标记每个物体的位置,然后给出答案”。这已被证明可以节省令牌并提高复杂空间谜题的准确率。对于导航或路径查找,基于 LLM 的智能体可以通过内部模拟在文本描述的地图上的移动来输出逐步指示。在多模态领域,VLM(例如 GPT-4V 或 PaLM-E)通过处理图像来执行空间推理。这些模型使用融合架构,结合视觉和文本特征,使它们能够查看房间图像并回答空间问题(椅子在桌子的左边吗?)。一些高级系统甚至允许 LLM 在推理过程中操作图像,例如,Visual ChatGPTOpenAI 的 Visual CoT可以旋转或缩放图像以更好地检查细节。这类似于人类歪着头来理解场景。这种工具辅助的视觉推理使人工智能能够更准确地处理空间任务。总而言之,通过整合空间表征(无论是通过符号还是视觉输入),GenAI 在模拟我们对物理世界理解的任务中变得更加强大。

Implemented in GenAI: Spatial reasoning is implemented both through specialized prompting and multimodal model design. On the text side, as noted, CoT can incorporate symbols: e.g., the prompt can instruct, let us use coordinates, mark positions of each object, and then answer. This was shown to save tokens and boost accuracy on complex spatial puzzles. For navigation or pathfinding, LLM-based agents can output step-by-step directions by internally simulating movements on a map described in text. In the multimodal realm, VLMs (like GPT-4V or PaLM-E) inherently perform spatial reasoning by processing images. These models use fusion architectures that combine visual and textual features, allowing them to, say, look at an image of a room and answer spatial questions (is the chair to the left of the table?). Some advanced systems even allow the LLM to manipulate images as part of reasoning, for example, Visual ChatGPT or OpenAI’s Visual CoT can rotate or zoom into an image to better inspect details. This is akin to a human tilting their head to understand a scene. Such tool-assisted visual reasoning enables the AI to handle spatial tasks with greater accuracy. Overall, by integrating spatial representations (either via symbols or via visual inputs), GenAI becomes much more capable at tasks that mirror our physical world understanding.

时间推理

Temporal reasoning

时间推理是指对时间、事件顺序、持续时间、频率和时间关系(例如,之前/之后、当……时、直到……为止等)进行推理。对于人工智能而言,时间推理对于理解故事、安排任务或理解过程至关重要。例如,人工智能应该从叙述中推断出爱丽丝在上班前吃完了早餐,这意味着早餐发生的时间早于上班。虽然这听起来很简单,但逻辑推理模型(LLM)可能会被复杂的基于时间的逻辑所困扰,尤其是在事件描述顺序与时间顺序不符或涉及隐式时间跳跃的情况下。时间推理还包括理解持续时间(例如,如果被告知约翰从下午 1 点开始睡了 2 小时的午觉,人工智能应该推断出他下午 3 点醒来)。在通用人工智能(GenAI)系统中,强大的时间推理能力可以确保故事的一致性(没有角色会凭空知道尚未发生的事情)、对序列问题的正确回答以及智能体的合理规划。研究表明,逻辑推理模型在处理时间逻辑方面仍然面临挑战,通常需要增强功能才能胜任。例如,一项研究指出,时间推理任务需要多种技能的结合,包括逻辑排序、基本算术(用于计算日期或持续时间)以及对典型时间线的常识性知识。为了提高LLM(时间推理能力)的表现,一种方法是使用中间时间表示,例如时间线或时间图TG )。最近的一种方法是将描述事件的文本转换为时间图(TG),即结构化的时间线,然后逻辑学习模型(LLM)使用时间概念(CoT)步骤对该图进行推理。通过将事件显式映射到时间线,模型可以更轻松地回答诸如“X 之前发生了什么?”“Y 是否发生在 Z 之后?”之类的问题。与让 LLM 自由构建其内部时间线相比,这种方法产生了更可靠的推理步骤和答案。在交互式智能体中,时间推理允许进行时间规划(例如,确定执行顺序:先加热烤箱,再混合食材,因为烤箱需要预热)。它还有助于消除歧义,如果提到两个相似的事件,了解哪个事件先发生可以澄清上下文(例如在故事或历史问题中)。总而言之,时间推理增强了人工智能的鲁棒性和连贯性,确保以类似人类的方式处理时间维度。

Temporal reasoning is reasoning about time, the order of events, durations, frequencies, and temporal relationships (before/after, while, until, etc.). For AI, temporal reasoning is needed to interpret stories, schedule tasks, or understand processes. For example, an AI should infer from a narrative that Alice finished breakfast before going to work, which implies breakfast happened earlier than work. While this sounds simple, LLMs can get confused with complex time-based logic, especially when events are described out of chronological order or involve implicit time jumps. Temporal reasoning also includes understanding durations (e.g., if told John took a 2-hour nap starting at 1 PM, the AI should conclude he woke at 3 PM). In GenAI systems, robust temporal reasoning ensures consistency in stories (no character magically knowing something that has not happened yet), correct answers in questions about sequences, and proper planning for agents. Research indicates that LLMs still struggle with temporal logic and often require augmentations to handle it. For instance, one study noted that temporal reasoning tasks require a combination of skills, logical ordering, basic arithmetic (for dates or durations), and commonsense knowledge of typical timelines. To improve LLM performance, a technique has been to use an intermediate temporal representation, such as a timeline or temporal graph (TG). In a recent approach, text describing events is converted into a TG, a structured timeline, and then the LLM reasons over that graph using CoT steps. By explicitly mapping events to a timeline, the model more easily answers questions like what happened just before X? or did Y happen after Z? This method yielded more reliable reasoning steps and answers than letting the LLM free-form its internal timeline. In interactive agents, temporal reasoning allows planning over time (e.g., figuring out an order of execution: first heat the oven, then mix ingredients, because the oven needs preheating). It also helps with disambiguation, if two similar events are mentioned, understanding which came first can clarify context (as in stories or historical questions). Overall, temporal reasoning adds to an AI’s robustness and coherence, ensuring that the dimension of time is handled in a human-like way.

GenAI 中的实现时间推理能力的提升源于对模型进行时间相关的显式训练。一种策略是时间 CoT(时间概念时间),即提示引导 LLM(逻辑逻辑模型)按顺序排列事件或逐步计算时间差。例如,提示可以这样说:“让我们先按事件发生的时间对这些事件进行排序,然后再回答相关问题。” 另一种策略是集成工具:LLM 代理可以调用​​日历 API 或日期计算器来处理复杂的日期运算(例如,从星期二算起 45 天是星期几?),从而避免错误。如前所述,将文本转换为时间图就像给模型提供一个内部时间线供其参考。构建这样的时间图(节点代表事件,边代表时间关系)后,AI 可以使用逻辑模块查询它,或者使用学习到的推理步骤遍历它。此外,专门的训练数据也能有所帮助,例如,使用带有标注事件时间线的故事或关于时间的数学应用题来微调模型(这样模型就能学习诸如经过时间之类的概念)。总而言之,GenAI 系统正越来越多地通过将语言模型与结构化时间表示相结合,并引导它们按时间顺序思考,来解决时间推理问题,从而更准确地理解事情发生的时间和顺序。

Implemented in GenAI: Improvements in temporal reasoning come from explicitly teaching models about time. One strategy is temporal CoT, where the prompt guides the LLM to list events in order or compute time differences step-by-step. For example, a prompt might say: Let us sort these events by when they happened, before answering a question about them. Another strategy is integrating tools: an LLM agent might call a calendar API or a date calculator to handle tricky date arithmetic (like what day of the week will it be 45 days from Tuesday?) to avoid mistakes. As mentioned, converting text to a temporal graph is like giving the model an internal timeline to consult. After building such a graph (nodes as events, edges as temporal relations), the AI can either query it with a logical module or traverse it with learned reasoning steps. Also, specialized training data can help, e.g., fine-tuning a model on stories with annotated event timelines or on math word problems about time (so it learns concepts like elapsed time). In summary, GenAI systems are increasingly addressing temporal reasoning by combining language models with structured time representations and by prompting them to think chronologically, which leads to a more accurate understanding of when things happen and in what sequence.

数学推理

Mathematical reasoning

数学推理是指解决数学问题并进行正确计算或符号运算的能力。这涵盖了从基本算术(例如 12 × 9 等于多少?)到复杂的文字题,甚至定理证明等各个方面。历史上,纯粹的神经语言模型因算术错误或无法解决多步骤数学问题而臭名昭著,因为它们倾向于基于模式识别来猜测答案。然而,借助诸如认知理论(CoT)之类的技术,语言学习模型(LLM)在数学问题解决方面取得了显著进步。关键在于,数学需要演绎式的、逐步推理,而这正是认知理论提示所鼓励的。例如,考虑这样一个文字题:罗杰有五个网球。他买了两个罐头,每个罐头装三个网球。他现在有多少个网球?如果直接询问模型,它可能会给出错误的答案,但如果使用认知理论提示,它就会给出正确的答案:他有 5 个网球。两个罐头,每个罐头装 3 个网球,所以他现在有 6 个网球。 5 + 6 = 11 ,然后得出结论:11。通过将每个步骤外部化(而不是试图在隐藏层中完成所有操作),该模型显著减少了错误。数学推理不仅仅是算术;它还包括代数推理(求解X )、几何推理(关于形状),甚至更多。逻辑谜题,例如数独,在逻辑逻辑模型(LLM)方面仍存在局限性,尤其是在缺乏工具辅助的情况下;由于长度和精度的限制,它们在处理非常大的数字或冗长的证明时可能会遇到困难。为了提高性能和准确性,人工智能系统通常会集成基于工具的数学方法。LLM 代理可以调用​​计算器或 Python 解释器进行精确计算,从而避免简单的算术错误。这种工具的使用已被证明能够有效地消除计算错误,而 LLM 则可以专注于正确地设置问题。LLM 的推理能力与工具的精确性相结合,能够产生正确且可解释的解决方案。模型用文字解释推理过程,而工具则提供数值答案。人工智能中的数学推理能力能够提升其在基准测试(例如 GSM8K,一个数学应用题集)上的性能,并且是衡量人工智能处理系统性逻辑任务能力的重要指标。

Mathematical reasoning refers to the ability to solve mathematical problems and perform correct calculations or symbol manipulations. This ranges from basic arithmetic (what is 12 × 9?) to complex word problems or even proving theorems. Historically, pure neural language models were notorious for making arithmetic errors or failing at multi-step math problems because they tended to guess answers based on pattern recognition. However, with techniques like CoT, LLMs have shown remarkable improvements in math problem-solving. The key is that math requires deductive, stepwise reasoning, exactly what CoT prompting encourages. For example, consider a word problem: Roger has five tennis balls. He buys two cans of three tennis balls each. How many balls does he have now? If asked naively, a model might output a wrong guess, but with a CoT prompt, it will do: He had 5. Two cans of 3 each means 6 more. 5 + 6 = 11. and then conclude 11. By externalizing each step (instead of trying to do it all in the hidden layers), the model dramatically reduces errors. Mathematical reasoning is not just arithmetic; it includes algebraic reasoning (solving for X), geometric reasoning (about shapes), and even logical puzzles like Sudoku. LLMs still have limits here, especially without tools; they might falter on very large numbers or long proofs because of length and precision limits. To bolster performance and accuracy, AI systems often incorporate tool-based approaches for math. An LLM agent can call a calculator or a Python interpreter for exact computation, ensuring no simple arithmetic mistakes. This kind of tool use has been shown to essentially eliminate calculation errors while the LLM focuses on setting up the problem correctly. The synergy of the LLM’s reasoning and the tool’s precision yields both correct and explainable solutions. The model explains the reasoning in words, and the tool provides the numeric answer. Mathematical reasoning in AI leads to improved performance on benchmarks (like GSM8K, a math word problem set) and is a good indicator of an AI’s ability to handle systematic logical tasks.

在 GenAI 中实现的Cot提示是提升低层次数学模型 (LLM) 数学推理能力的关键范式转变。开发者在提示中包含带有分步解答的示例,或者指示模型逐步思考数学问题。这使得像 GPT-3.5 这样的模型能够正确解答许多它们以前无法解答的小学生数学题。对于更高级的数学或更复杂的计算,集成外部工具是一种常见的做法。例如,OpenAI 的代码解释器允许 ChatGPT 编写和运行 Python 代码;用户可以提出一个复杂的数学问题,模型会生成一个小型脚本来计算答案,将逻辑推理的设置与机器的完美计算相结合。在像 ReAct 这样的智能体框架中,一个数学问题可能会触发 LLM 发出一个动作 ` Calculator[表达式]` ,获取结果,然后基于该结果继续推理。还有一些专门的神经符号模型(例如用于编码的AlphaCode或用于定理求解的MetaMath ),它们将神经网络与形式化数学求解器相结合。这些系统通过生成假设(潜在解)并对其进行形式化验证来处理数学问题,这与人类检验方程解的方式非常相似。总而言之,数学推理是通过精心设计的提示来实现的,这种提示鼓励逻辑推理,有时还会结合符号模块或工具来执行繁重的数学运算,从而使人工智能能够在答案中实现正确性和清晰的论证。

Implemented in GenAI: Cot prompting is the main paradigm shift that unlocked much better mathematical reasoning in LLMs. Developers include worked examples with step-by-step solutions in the prompt or instruct the model to think step-by-step for math questions. This has enabled even models like GPT-3.5 to solve many grade-school math problems correctly, where they previously failed. For higher-level math or longer calculations, integrating external tools is common. For instance, OpenAI’s code interpreter allows ChatGPT to write and run Python code; a user can ask a complex math question, and the model will generate a small script to compute the answer, combining logical setup from its reasoning with flawless computation by the machine. In agent frameworks like ReAct, a math question might trigger the LLM to issue an action, Calculator[expression], get the result, and then continue the reasoning with that number. There are also specialized neuro-symbolic models (like AlphaCode for coding or MetaMath for theorem solving) that blend neural networks with formal math solvers. These systems treat math problems by generating hypotheses (potential solutions) and formally verifying them, much like a human might test an equation solution. In summary, mathematical reasoning is implemented through careful prompt design that encourages logical breakdown, sometimes combined with symbolic modules or tools that execute the grunt work of math, allowing the AI to achieve both correctness and clear justification in its answers.

基于工具的推理和ReAct代理

Tool-based reasoning and ReAct agents

在人工智能领域,基于工具的推理指的是智能体在其推理过程中使用外部工具或应用程序接口(API)(例如搜索引擎、计算器、数据库,甚至是其他人工智能模型)的能力。人工智能不再仅仅依赖其内部知识,而是能够识别何时可以使用工具,然后采取行动获取信息或执行操作,并根据结果进行推理。ReAct 框架是形式化这一过程的领先范式。在 ReAct 智能体中,逻辑逻辑模型(LLM)并非仅仅输出答案;它将思维过程(CoT 推理)与工具调用等行动交织在一起。例如,考虑一个复杂的问题:2018 年 FIFA 世界杯冠军国家的首都是哪里?使用工具的智能体会认为这个问题是在询问 2018 年世界杯冠军国家的首都。这个国家是法国(赢得了 2018 年世界杯)。法国的首都是巴黎。然而,可以肯定的是,它可能会执行以下操作:搜索2018 年世界杯冠军(得到法国),然后搜索法国首都(得到巴黎),最后回答“巴黎”。在此过程中,智能体的推理轨迹可能如下所示:

Tool-based reasoning in AI refers to an agent’s ability to use external tools or APIs (such as search engines, calculators, databases, or even other AI models) as part of its reasoning process. Instead of relying solely on its internal knowledge, the AI recognizes when a tool can help and then acts to fetch information or perform an operation, and then reasons with the result. The ReAct framework is a leading paradigm that formalizes this process. In ReAct agents, the LLM does not just output an answer; it interleaves thoughts (CoT reasoning) with actions like tool calls. For example, consider a complex question: what is the capital of the country that won the FIFA World Cup in 2018? A tool-using agent will think that the question asks for the capital of the country that won in 2018. That country was France (won the 2018 World Cup). The capital of France is Paris. However, to be sure, it might perform actions: search for the 2018 World Cup winner (gets France), then search for the capital of France (gets Paris), and then answer Paris. During this process, the agent’s reasoning trace might look like:

想法:我需要找到2018年世界杯的冠军。

Thought: I need to find the World Cup 2018 winner.

行动方向是搜索(2018年世界杯冠军)

Action would be search(2018 World Cup winner)

观察:法国在2018年获胜。

Observation: France won in 2018.

思考:现在找出法国的首都。

Thought: Now find the capital of France.

行动将是搜寻(法国首都)

Action would be search(capital of France)

观察:首都是巴黎。

Observation. The capital is Paris.

我当时想,所以答案是巴黎。

Thought would be, So the answer is Paris.

该框架提高了准确性和鲁棒性,因为模型可以获取最新或精确的信息,而不是进行猜测(从而减少了臆测)。它还有助于消除歧义,例如,如果问题不明确,智能体可以快速查找信息或提出澄清问题。根据 ReAct 论文,此类智能体在知识密集型任务上表现出更优异的性能,并减少了由模型不确定性导致的错误。本质上,基于工具的推理使人工智能系统能够克服其训练局限性。如果逻辑逻辑模型 (LLM) 不知道某些信息(例如,最近发生的事件或复杂的计算),工具调用可以提供该知识,然后 LLM 的推理过程可以将这些信息整合到答案中。这种协同作用模拟了人类的思维方式:我们使用记事本进行计算,使用搜索引擎查找事实等等,从而产生更可靠、更值得信赖的人工智能输出。

This framework improves accuracy and robustness because the model can fetch up-to-date or precise information rather than guessing (reducing hallucinations). It also helps with disambiguation, like if the question is unclear, the agent can do a quick lookup or ask a clarifying question as a tool. According to the ReAct paper, such agents showed superior performance on knowledge-intensive tasks and reduced errors that come from the model’s uncertainty. Essentially, tool-based reasoning lets AI systems overcome their training limitations. If an LLM does not know something (e.g., a very recent event or a tricky calculation), a tool call can supply that knowledge, and then the LLM’s reasoning can integrate it into the answer. This synergy mimics how humans think; we use notepads for calculation, search engines for facts, etc., resulting in more reliable and trustworthy AI outputs.

GenAI 中已实现基于现代语言学习模型 (LLM) 的智能体(例如,使用 LangChain 等框架或 OpenAI 的函数调用 API 构建的智能体)通过结构化提示来实现工具的使用。典型的 ReAct 提示可能包含以下示例:

Implemented in GenAI: Modern LLM-based agents (e.g., those built with frameworks like LangChain, or OpenAI’s Function calling API) operationalize tool use with structured prompts. A typical ReAct prompt might include examples like:

想法:我需要知道X

Thought: I need to know X

我将使用工具 Y。操作:Y(查询)。

I will use Tool Y. Action: Y(query).

观察。

Observation.

想法:基于此,接下来我将……等等。

Thought: Based on that, next I will... and so on.

智能体持续执行此循环,直到能够得出最终答案(完成[答案] )。工具可以是任何东西:网络搜索(用于获取知识)、计算器(用于数学运算)、翻译 API、数据库查询,甚至是多模态智能体中的图像识别。提示工程确保逻辑逻辑模型 (LLM) 了解可用的工具以及如何格式化操作。由于 LLM 的认知能力 (CoT) 与操作显式关联,系统可以通过分解任务来处理非常复杂的任务:推理决定需要做什么以及按什么顺序执行,而操作则获取结果或产生更改。这极大地提高了在不熟悉的情况下做出决策的能力。例如,一个 AI 家庭助手遇到“我的网络断了,我该怎么办?”这样的问题。它可能在训练中没有这个答案,但通过使用工具,它可以执行一系列步骤:ping 服务器、阅读故障排除指南等等,然后给出解决方案。多模态智能体也使用基于工具的推理:例如,一个智能体可以查看图像,然后使用 OCR 模块作为工具读取图像中的文本,并对其进行推理。像ReAct这样的基于工具的推理框架在推动人工智能从静态问答转向交互式问题解决方面发挥了关键作用。它们有助于提高模型的鲁棒性(由于模型可以验证事实,因此错误答案更少),并在某种意义上实现了持续学习,因为模型可以随时获取更新的信息,并且较少受到固定训练数据的限制。总之,基于工具的推理赋予了GenAI以下能力:一种增强智能形式,将模型的文本推理能力与外部工具的精确能力相结合,从而在复杂的现实世界任务中取得更好的性能。

The agent continues this loop until it can formulate a final answer (finish[answer]). Tools can be anything: a web search (for knowledge), a calculator (for math), a translation API, a database lookup, or even image recognition in a multimodal agent. The prompt engineering ensures the LLM knows the tools available and how to format actions. Because the LLM’s CoT is explicitly connected to actions, the system can handle very complex tasks by decomposing them: the reasoning decides what needs to be done and in what order, and the acting fetches results or effects changes. This greatly improves decision-making in unfamiliar situations. For instance, an AI home assistant faced with my internet is down, what should I do? It might not have that answer in its training, but with tool use, it can run through steps: ping a server, read a troubleshooting guide, etc., then give a solution. Multimodal agents also use tool-based reasoning: an example is an agent that can see an image and then use an OCR module as a tool to read text in the image, then reason about it. Tool-based reasoning frameworks like ReAct have been pivotal in moving AI beyond static QA to interactive problem-solving. They contribute to robustness (fewer incorrect answers since the model can verify facts) and enable continuous learning, in a sense, because the model can always fetch updated info, and it is less constrained by the fixed training data. In sum, tool-based reasoning equips GenAI with a form of augmented intelligence, combining the model’s textual reasoning with the precise capabilities of external tools to achieve far better performance on complex, real-world tasks.

人工智能系统中的多模态推理与融合

Multimodal reasoning and fusion in AI systems

虽然上述许多推理类型都是在文本的语境下讨论的,但多模态智能体将推理扩展到各种数据类型,例如图像、音频、视频和文本的组合。在这样的系统中,推理涉及融合来自多种模态的信息,并可能使用一种模态来消除歧义或确认来自另一种模态的信息。例如,假设一个人工智能看到一张凌乱房间的图像,并被问到“吸尘器能吸到沙发底下的面包屑吗?”它必须将视觉空间推理(来自图像)与物理常识相结合才能回答这个问题。多模态推理得益于能够执行模态对齐和融合的架构。对齐是指连接对应的元素(例如,将图像描述句子与图像中的区域匹配),而融合是指联合处理输入以产生统一的理解。像 GPT-4 Vision 和 Google 的 PaLM-E 这样的模型使用 Transformer 来接受文本和视觉嵌入,因此模型能够有效地在一个组合表示中看到和读到信息。这使得模型能够识别图像中的物体,并利用世界知识对其进行推理。值得注意的是,OpenAI 最近的研究表明,一些模型能够在其认知能力中心(CoT)内进行图像思考,这意味着模型可以在逐步推理过程中执行内部视觉处理。例如,模型可能会在内部决定放大图像的某个部分或旋转图像以读取文本,所有这些都是解决问题的中间步骤。这本质上是一种多模态的 ReAct:模型将图像操作视为认知能力中心推理过程中的工具。其结果是在视觉问答VQA )、基于图像的故障排除或基于图像的空间推理等任务中取得了显著的改进。通过融合视觉和文本推理,这些系统在需要理解两种模态的基准测试中取得了最先进的性能。例如,人工智能可以同时读取图表(图像)和相关的文本段落,进行推理,并回答需要两种模态的复杂科学问题,这是纯文本模型或纯图像模型都难以单独完成的任务。多模态融合有助于消除歧义(图像可以阐明文本所指内容,反之亦然)并增强鲁棒性(人工智能不太可能对视觉细节产生错觉,因为它能够看到这些细节)。它还开辟了更高级的应用:多模态智能体可以在物理环境中规划行动(视觉提供当前状态,语言推理提供规划能力),或者提供更丰富的解释(在进行语言推理的同时指出图像的某些部分)。

While many of the above reasoning types are discussed in the context of text, multimodal agents extend reasoning across various data types, e.g., images, audio, video, and text together. In such systems, reasoning involves fusing information from multiple modalities and potentially using one modality to disambiguate or confirm information from another. For instance, consider an AI that sees an image of a messy room and is asked, can the vacuum reach the crumbs under the couch? It must combine visual spatial reasoning (from the image) with physical commonsense to answer. Multimodal reasoning is enabled by architectures that perform alignment and fusion of modalities. Alignment means linking corresponding elements (e.g., matching a caption sentence to a region in an image), and fusion means jointly processing the inputs to produce a unified understanding. Models like GPT-4 Vision and Google’s PaLM-E use transformers that accept both text and visual embeddings, so the model effectively sees and reads in one combined representation. This allows it to do things like identify an object in an image and then reason about it with world knowledge. Notably, OpenAI’s recent research demonstrated models that think with images in their CoT, meaning the model can perform internal visual processing as part of step-by-step reasoning. For example, the model might internally decide to zoom into a part of an image or rotate it to read text, all as intermediate steps in solving a problem. This is essentially a multimodal ReAct: the model treats image manipulations as tools within a CoT reasoning process. The result is a significant improvement in tasks like visual QA (VQA), image-based troubleshooting, or spatial reasoning from pictures. By fusing visual and textual reasoning, these systems achieve state-of-the-art performance on benchmarks that require understanding both modalities. For instance, an AI can read a diagram (image) and a related text paragraph together, reason about them, and answer a complex science question that needs both modalities, something neither text-only nor image-only models could easily do alone. Multimodal fusion contributes to disambiguation (the image can clarify what the text refers to and vice versa) and to robustness (the AI is less likely to hallucinate about visual details because it sees them). It also opens advanced applications: a multimodal agent can plan actions in a physical environment (vision gives it the current state, language reasoning gives it planning ability) or provide richer explanations (pointing to parts of an image while verbally reasoning).

在 GenAI 中实现在架构层面,多模态 Transformer 通过交叉注意力等技术整合不同模态例如,文本标记关注图像特征图。通常有两种模式:双塔模型分别对每种模态进行编码,然后在后期进行组合(例如,通过拼接或小型融合网络);以及单塔模型,从一开始就在一个网络中处理混合模态输入。GPT-4 Vision 等模型采用的是单塔(完全融合)方法,本质上是将图像块和文本标记一起作为标记进行处理。这种紧密集成能够实现细致入微的推理,例如在生成文本时引用图像中的特定对象。在软件方面。另一方面,像Hugging GPT等框架会在一个推理循环中协调多个专家模型(一个用于视觉,一个用于语言);语言模型决定何时调用视觉模型(作为工具),然后使用其结果。这是一种实现多模态推理的模块化方法:语言学习模型(LLM)的认知能力循环(CoT)包含诸如自我提问之类的步骤;例如,“我有一张图像,让我向视觉模块请求描述,然后使用该描述来回答问题” 。此类系统已成功处理了诸如描述图像并回答后续问题之类的任务。视觉认知能力循环提示是另一种新兴技术:它不仅提示模型进行文本思考,还提示模型想象或绘制解决方案。例如,为了解决一个关于打结的难题,提示可能会鼓励模型将步骤可视化(一些研究甚至让模型生成 ASCII 伪图作为推理的一部分!)。虽然这些方法仍处于早期阶段,但它们指向了能够使用类似想象过程的人工智能。最后,多模态推理通过利用每种模态的互补优势来提高全面性;人工智能能够获得更全面的信息。经验表明,结合多种模态通常可以提高准确性和泛化能力。因此,多模态GenAI智能体能够处理以前无法完成的复杂任务(例如解释一个网络迷因,这需要视觉、语言和文化常识),这一切都得益于将上述推理类型(空间推理、因果推理、常识推理等)整合到一个统一的多模态框架中。

Implemented in GenAI: At the architecture level, multimodal transformers incorporate modalities through techniques like cross-attention, where, say, text tokens attend to image feature maps. There are generally two patterns: a two-tower model encodes each modality separately, then combines at a later stage (e.g., via concatenation or a small fusion network), and a one-tower model that, from the start, processes mixed modality input in one network. The one-tower (fully fused) approach is what models like GPT-4 Vision use, essentially treating image patches like tokens alongside text tokens. This tight integration enables nuanced reasoning, like referencing a specific object in the image when generating text. On the software side, frameworks like HuggingGPT and others orchestrate multiple expert models (one for vision, one for language) in a reasoning loop; the language model decides when to call the vision model (as a tool), and then uses the result. This is a modular way to get multimodal reasoning: the LLM’s CoT includes steps like self-questioning; I have an image, let me ask the vision module for a description, then using that description, I will answer the question. Such systems have successfully handled tasks like describing an image and then answering follow-up questions about it. Visual CoT prompting is another emerging technique: the model is prompted with not just textual thinking but also to imagine or sketch out a solution. For example, to solve a puzzle about tying knots, the prompt might encourage the model to visualize the steps (some research gets models to produce a pseudo-drawing in ASCII as part of reasoning!). While in the early stages, these approaches point towards AI that can use imagination-like processes. Finally, multimodal reasoning improves comprehensiveness by leveraging complementary strengths of each modality; the AI gets a fuller picture. Empirically, combining modalities often boosts accuracy and generalization. Multimodal GenAI agents can therefore tackle complex tasks (like explaining a meme, which needs vision + language + cultural commonsense) that were previously out of reach, all by integrating the reasoning types discussed (spatial, causal, commonsense, etc.) within a unified multimodal framework.

因此,如今的GenAI系统将这些不同的推理类型交织在一起,以实现更接近人类的智能。每种推理类型都以其独特的方式为提高AI输出的准确性、连贯性和可靠性做出贡献。例如,演绎逻辑确保在给定规则下的一致性和正确性;归纳和溯因推理允许创造性地处理不确定性;类比推理实现知识迁移;而强大的常识、因果、空间和时间推理则避免了早期模型对世界产生的种种离奇错误。数学推理和工具的使用极大地提高了精确度和事实准确性,弥补了以往模型的关键缺陷。诸如CoT提示之类的实现已经证明,提示LLM(逻辑学习模型)进行思考可以显著提高其在数学、逻辑和常识任务中的表现。像ReAct这样的智能体框架更进一步,允许模型根据其思考过程采取行动(例如,浏览或计算),这使得决策更加务实,不易产生幻觉。随着我们拥抱多模态融合,人工智能可以充分利用视觉和文本信息的丰富性,从而在复杂的现实世界场景中实现稳健的理解和推理。至关重要的是,研究表明,没有一种推理策略能够完美解决所有问题——每种方法都能以独特的方式应对特定的挑战。因此,人工智能的前沿技术在于将这些推理类型结合起来。通过为生成模型配备一套推理技能工具箱以及相应的策略选择,我们正朝着能够像人类一样灵活可靠地思考问题,甚至超越人类的人工智能系统迈进。

So, today’s GenAI systems intertwine these diverse reasoning types to achieve more human-like intelligence. Each type of reasoning contributes in its own way to making AI outputs more accurate, coherent, and reliable. For instance, deductive logic ensures consistency and correctness given rules, induction and abduction allow creativity and handling of uncertainty, analogies enable knowledge transfer, and strong commonsense, causal, spatial, and temporal reasoning prevent the bizarre mistakes earlier models made about the world. Mathematical reasoning and tool use greatly enhance precision and factual accuracy, addressing key weaknesses of the past. Implementations like CoT prompting have proven that prompting an LLM to think aloud can significantly improve performance across math, logic, and commonsense tasks. Agent frameworks like ReAct go a step further by letting the model act on its thoughts (e.g., browsing or calculating), which makes decision-making more grounded and less prone to hallucination. And as we embrace multimodal fusion, AI can draw on the full richness of visual and textual information, leading to robust understanding and reasoning in complex, real-world scenarios. Crucially, research has shown that no single reasoning strategy is best for all problems—each approach can uniquely solve certain challenges. Therefore, the cutting edge of AI is about combining these reasoning types. By equipping generative models with a toolbox of reasoning skills and the strategies to choose among them, we are moving closer to AI systems that can think through problems as flexibly and reliably as humans do, if not more so.

关于推理基准

About reasoning benchmark

推理基准测试是专门的评估工具,旨在衡量逻辑逻辑模型(LLM)在解决问题、进行逻辑推理以及得出正确结论方面的能力,而不仅仅是简单的模式匹配。与传统的自然语言处理NLP )基准测试不同,推理基准测试……侧重于语言流畅性或事实记忆的推理基准测试,旨在检验多步骤问题解决能力、数学演绎能力、因果推断能力以及规划能力,这些能力对于应对复杂的现实任务至关重要。这些基准测试提供涵盖科学、法律和常识推理等不同领域的标准化挑战性场景,帮助研究人员客观地评估模型性能、识别缺陷、比较系统并跟踪进展。它们对于确保语言学习硕士(LLM)不仅能清晰表达,而且真正具备稳健可靠的推理能力至关重要。

Reasoning benchmarks are specialized evaluation tools designed to measure how well LLMs can think through problems, make logical inferences, and arrive at correct conclusions beyond simple pattern matching. Unlike traditional natural language processing (NLP) benchmarks that focus on language fluency or factual recall, reasoning benchmarks test multi-step problem-solving, mathematical deduction, causal inference, and planning skills essential for tackling complex, real-world tasks. By providing standardized, challenging scenarios across diverse domains such as science, law, and commonsense reasoning, these benchmarks help researchers objectively assess model performance, identify weaknesses, compare systems, and track progress over time. They are critical for ensuring that LLMs are not only articulate but also genuinely capable of robust, reliable reasoning.

下表总结了用于评估LLM推理能力的广泛认可的基准,重点介绍了它们的主要目的和关注领域:

The following table summarizes widely recognized benchmarks used to evaluate the reasoning capabilities of LLMs, highlighting their primary purpose and areas of focus:

基准

Benchmark

目的

Purpose

重点

Focus

大规模多任务语言理解MMLU )、AI2推理挑战赛ARC )、HellaSwag、小学数学8K GSM8K

Massive Multitask Language Understanding (MMLU), AI2 Reasoning Challenge (ARC), HellaSwag, Grade School Math 8K (GSM8K)

一般推理能力、常识和数学能力。

General reasoning, commonsense, and math.

具备广泛的推理能力。

Broad reasoning skills.

谷歌认证问答( GPQA )、数学、LogiQA

Google-proof Question and Answers (GPQA), MATH, LogiQA

STEM 和逻辑方面的高级推理能力。

Advanced reasoning in STEM and logic.

深度领域特定推理。

Deep domain-specific reasoning.

R-Bench、OneEval、人类的最后考试HLE

R-Bench, OneEval, Humanity’s Last Exam (HLE)

多学科或结构化推理。

Multidisciplinary or structured reasoning.

具有挑战性的跨领域评估。

Challenging cross-domain evaluation.

高级推理基准测试( ARB )、PlanBench

Advanced Reasoning Benchmark (ARB), PlanBench

复杂、专业的推理场景。

Complex, specialized reasoning scenarios.

更高层次的推理深度。

Next-level reasoning depth.

OptiLLMBench

OptiLLMBench

推理技术的影响。

Inference techniques impact.

推理效率。

Reasoning efficiency.

苹果的谜题

Apple’s puzzles

压力测试推理极限。

Stress-testing reasoning limits.

模型稳健性评估。

Model robustness evaluation.

表 12.1 LLM 推理评估的关键基准

Table 12.1: Key benchmarks for LLM reasoning evaluation

结论

Conclusion

本章为理解GenAI系统中的推理奠定了理论基础。通过探索从演绎逻辑到多模态融合等一系列推理类型,我们重点阐述了每种推理方式如何促进更智能、更可靠、更具情境感知能力的AI行为。我们考察了推理如何增强诸如消歧、规划、工具使用和解释等能力。借助CoT提示和ReAct式智能体设计等技术,推理成为指导AI输出的实用工具。这些基础知识将帮助您构建先进的GenAI系统,这些系统不仅能够生成任务,还能对复杂的现实世界任务进行推理。

In this chapter, we laid the theoretical groundwork for understanding reasoning in GenAI systems. By exploring a range of reasoning types, from deductive logic to multimodal integration, we highlighted how each contributes to more intelligent, reliable, and context-aware AI behavior. We examined how reasoning enhances capabilities like disambiguation, planning, tool use, and explanation. With techniques such as CoT prompting and ReAct-style agent design, reasoning becomes a practical tool for guiding AI outputs. This foundational understanding equips you to build advanced GenAI systems that not only generate but also reason through complex, real-world tasks.

下一章,我们将在 GenAI 系统中实现两种推理方式。

In the next chapter, we will implement two types of reasoning in GenAI systems.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第13高级多模态GenAI系统实施

CHAPTER 13Advanced Multimodal GenAI Systems Implementation

介绍

Introduction

在为生成式人工智能GenAI )的推理建立了完善的理论基础之后,我们现在将重点从推理的重要性转移如何其付诸实践。在本章中,我们将了解构建推理增强型GenAI系统所需的架构设计模式、框架和模块化组件。

Having established a thorough theoretical foundation for reasoning in generative AI (GenAI), we now shift focus from why reasoning matters to how it can be practically implemented. In this chapter, we will understand the architectural design patterns, frameworks, and modular components required to build reasoning-augmented GenAI systems.

你将探索使用 LangChain、Ollama 和 Python 等工具的实际应用,并学习如何将思维链( CoT ) 提示、推理和行动( ReAct ) 式的智能体工作流程以及工具增强执行相结合,构建可扩展的 AI 流水线。通过实践代码讲解和可重用模板,你将学习如何构建能够检索、推理、行动和适应文本、图像和结构化数据的系统。

You will explore real-world implementations using tools like LangChain, Ollama, and Python, and learn how to combine chain of thought (CoT) prompting, reasoning and acting (ReAct) style agent workflows, and tool-augmented execution into scalable AI pipelines. Through hands-on code walkthroughs and reusable templates, you will learn how to engineer systems that retrieve, reason, act, and adapt across text, images, and structured data.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 基因人工智能系统中推理的提示技术
  • Prompting techniques for reasoning in GenAI systems
  • 重排序阶段推理的架构
  • Architecture for reasoning at the reranking stage
  • 推荐阶段推理的架构
  • Architecture for reasoning at the recommendation stage

目标

Objectives

本章旨在全面阐述GenAI系统中的推理机制。首先,本章探讨了有助于语言模型进行结构化推理的高级提示技术。随后,本章深入研究了在重排序阶段集成推理的架构框架和实现策略,在该阶段,检索到的候选结果将被评估和优化。最后,本章考察了推荐阶段的推理,展示了如何综合多源数据和用户画像以生成上下文感知的推荐。通过实际案例和设计原则,读者将深入了解如何构建智能的、具备推理能力的AI系统,以用于检索优化和个性化推荐。

This chapter aims to provide a comprehensive understanding of reasoning mechanisms within GenAI systems. It begins by exploring advanced prompting techniques that facilitate structured reasoning in language models. The chapter then delves into architectural frameworks and implementation strategies for integrating reasoning at the reranking stage, where retrieved candidates are evaluated and refined. Finally, it examines reasoning at the recommendation stage, demonstrating how multi-source data and user profiles can be synthesized to generate context-aware suggestions. Through practical examples and design principles, readers will gain insights into building intelligent, reasoning-capable AI systems for both retrieval refinement and personalized recommendations.

基因人工智能系统中推理的提示技术

Prompting techniques for reasoning in GenAI systems

在理解代码和架构之前,让我们先快速了解一下推理提示技术。提示是引导和提取大型语言模型LLM )推理能力的关键技术之一。随着模型功能的日益强大,提示策略也必须不断发展,以支持结构化、可解释的多步骤推理,而不仅仅是模式识别。本节首先介绍零样本提示和少样本提示等基础提示策略,然后逐步深入到更高级的方法,例如思维导图(CoT)、思维树ToT )、ReAct 等,这些方法能够显式地构建和丰富 GenAI 系统中的推理过程。

Before understanding code and architecture, let us have a quick understanding of prompting techniques for reasoning. Prompting is one of the most critical techniques for guiding and extracting reasoning capabilities from large language models (LLMs). As models grow more powerful, prompting strategies must evolve to support not just pattern recognition, but structured, explainable, and multi-step reasoning. This section begins with foundational prompting strategies like zero-shot and few-shot prompting. Then it progresses towards the advanced methods such as CoT, tree of thoughts (ToT), ReAct, and others that explicitly scaffold and enrich reasoning processes in GenAI systems.

基本提示技巧

Basic prompting techniques

本节介绍两种基础提示策略:零样本提示和少样本提示。这两种策略被广泛用于指导逻辑学习模型(LLM)生成准确且情境感知的反应。虽然这些技术无需对模型进行微调即可有效地引出与任务相关的行为,但值得注意的是,还有更广泛的高级提示策略,例如基于情境的提示(CoT提示)、自一致性提示、工具增强提示和对比提示。

This section introduces two foundational prompting strategies, which are zero-shot prompting and few-shot prompting, that are widely used to guide LLMs in generating accurate and context-aware responses. While these techniques offer powerful ways to elicit task-relevant behavior without model fine-tuning, it is worth noting that a broader range of advanced prompting strategies, such as CoT prompting, self-consistency, tool-augmented prompting, and contrastive prompting.

零次提示

Zero-shot prompting

零样本提示是指在不提供任何先验示例的情况下,指示模型执行任务。模型完全依赖于其预训练知识和给定的自然语言指令。

Zero-shot prompting refers to instructing a model to perform a task without providing any prior examples in the prompt. Instead, the model relies entirely on its pretrained knowledge and the natural language instructions given.

例子

Example:

  • 提示将下列句子翻译成法语:我很高兴。
  • Prompt: Translate the following sentence into French: I am happy.
  • 回复Je suis heureux.
  • Response: Je suis heureux.

在推理任务中,零样本提示通常与过程导向的提示(例如“让我们一步一步地思考”)结合使用,以帮助引出隐含的推理链。这种方法已被证明能够通过鼓励模型将思维过程外化,从而提高算术、逻辑和常识问题的表现。

When used in reasoning tasks, zero-shot prompting is often paired with process-oriented cues such as let us think step-by-step, which help elicit implicit reasoning chains. This approach has been shown to improve performance on arithmetic, logic, and commonsense problems by encouraging the model to externalize its thought process.

其好处如下:

The benefits are as follows:

  • 无需提供具体任务示例。
  • Requires no task-specific examples.
  • 易于跨域部署。
  • Easy to deploy across domains.
  • 适用于快速实验或广泛推广。
  • Useful for rapid experimentation or broad generalization.

以下是其局限性:

The following are the limitations:

  • 如果没有指导,可能无法胜任复杂任务。
  • May underperform on complex tasks without guidance.
  • 对提示语措辞敏感。
  • Sensitive to prompt phrasing.
  • 无法明确展示推理过程。
  • Cannot demonstrate reasoning format explicitly.

少镜头提示

Few-shot prompting

少样本提示是指在提示中包含少量输入/输出( I/O ) 示例。以下是一些示例,它们作为情境演示,引导模型理解任务形式、推理风格或领域预期。

Few-shot prompting involves including a small number of input/output (I/O) examples within the prompt. The following are some examples that serve as in-context demonstrations that guide the model in understanding the task format, reasoning style, or domain expectations.

示例(推理任务)

Example (reasoning task):

问:汤姆有3个苹果。他又买了2个苹果。他现在有多少个苹果?

Q: Tom has 3 apples. He buys 2 more. How many does he have?

A:汤姆一开始有3个苹果。他又买了2个。所以,3 + 2 = 5。答案:5

A: Tom starts with 3 apples. He buys 2 more. So, 3 + 2 = 5. Answer: 5

问:一瓶容量为1.5升。3瓶一共是多少升?

Q: A bottle holds 1.5 liters. How much in 3 bottles?

A:每个瓶子容量为1.5升。1.5 × 3 = 4.5。答案:4.5

A: Each bottle holds 1.5 liters. 1.5 × 3 = 4.5. Answer: 4.5

问:一辆汽车以 40 公里/小时的速度行驶 2 小时,行驶了多远?

Q: A car travels at 40 km/h for 2 hours. How far?

A:汽车以 40 公里/小时的速度行驶了 2 小时。所以,40 × 2 = 80,答案是 80。

A: The car travels at 40 km/h for 2 hours. So, 40 × 2 = 80 so Answer is 80

少量提示对于引出 CoT 推理尤其有效,因为模型在得出最终答案之前,会学习阐明中间步骤。

Few-shot prompting is specifically effective for eliciting CoT reasoning, where the model learns to articulate intermediate steps before arriving at a final answer.

其好处如下:

The benefits are as follows:

  • 显著提高结构化任务的准确率。
  • Significantly improves accuracy on structured tasks.
  • 通过模仿例子来培养推理模式。
  • Encourages reasoning patterns via example imitation.
  • 无需对模型进行微调。
  • Does not require model fine-tuning.

以下是其局限性:

The following are the limitations:

  • 提示框大小受上下文窗口限制。
  • Prompt size is constrained by the context window.
  • 需要精心挑选示例。
  • Requires careful example curation.
  • 易受顺序效应和即时效应影响。
  • Susceptible to order effects and prompt sensitivity.

基因人工智能系统中推理的高级提示策略

Advanced prompting strategies for reasoning in GenAI systems

虽然零样本和少样本提示奠定了基础,但诸如规划、工具使用和多模态整合等高级推理任务通常需要对推理、记忆或搜索进行明确的框架式指导。以下策略代表了在逻辑逻辑模型(LLM)和基因人工智能(GenAI)智能体中实现更深入、更易解释和更稳健的推理能力的新兴最佳实践。

While zero-shot and few-shot prompting provide the foundation, advanced reasoning tasks, such as planning, tool-use, and multimodal integration, often require explicit scaffolding of reasoning, memory, or search. The following strategies represent emerging best practices for enabling deeper, more interpretable, and more robust reasoning capabilities in LLMs and GenAI agents.

本节概述了旨在为 LLM 和多模态智能体中的推理提供支持的高级提示范式:

This section provides an overview of advanced prompting paradigms that specifically aim to scaffold reasoning in LLMs and multimodal agents:

  • ToT提示:ToT(Yao等人,2023或https://arxiv.org/abs/2305.10601)扩展了CoT提示,使模型能够在结构化的决策树中探索多条推理路径。每个分支代表一个不同的中间步骤或想法,从而允许模型生成、评估和选择不同的思路。这种方法支持对多种可能性进行深思熟虑,并且已证明在需要基于搜索的问题解决、规划和创造性构思的任务中表现更佳。
  • ToT prompting: ToT (Yao et al., 2023 or https://arxiv.org/abs/2305.10601) extends CoT prompting by enabling the model to explore multiple reasoning paths in a structured decision tree. Each branch represents a distinct intermediate step or idea, allowing the model to generate, evaluate, and select among divergent thoughts. This approach supports deliberation over multiple possibilities and has demonstrated improved performance in tasks requiring search-based problem-solving, planning, and creative ideation.
  • 思维图(Graph-of-Thoughts,GoT):GoT是对思维导图(ToT)的推广,它允许非线性和循环的推理结构。GoT并非采用严格的树状结构,而是将推理建模为图,从而支持概念间的相互关联、回溯和多模态融合。这种结构在多跳问答QA )、对话系统和交互式规划中尤为有用,因为在这些场景中,推理并非遵循单一的线性路径。GoT通常与记忆模块或代理框架结合使用,以支持跨步骤的持久性和上下文推理。
  • Graph-of-Thoughts (GoT): GoT generalizes ToT by allowing non-linear and cyclic reasoning structures. Instead of a strict tree, it models reasoning as a graph, enabling interlinked concepts, backtracking, and multimodal fusion. This structure is particularly useful in multi-hop question answering (QA), dialogue systems, and interactive planning, where reasoning does not follow a single linear path. GoT is often combined with memory modules or agent frameworks to support persistent and contextual reasoning across steps.
  • 自洽性提示:自洽性(Wang等人,2022 或https://arxiv.org/abs/2203.11171)通过对同一输入进行多次推理轨迹采样,并基于多数投票或概率共识选择最终答案,从而提升了认知能力(CoT)的性能。这减轻了单次生成过程的变异性,并降低了错误推理路径的影响。该方法在算术推理、逻辑谜题和常识推理等领域尤为有效,因为在这些领域,单个错误步骤就可能导致最终输出无效。
  • Self-consistency prompting: Self-consistency (Wang et al., 2022 or https://arxiv.org/abs/2203.11171) improves CoT performance by sampling multiple reasoning traces for the same input and selecting the final answer based on majority vote or probabilistic consensus. This mitigates the variability of single-pass generation and reduces the impact of incorrect reasoning paths. The approach is especially effective in domains such as arithmetic reasoning, logic puzzles, and commonsense inference, where a single incorrect step can invalidate the final output.
  • 符号链 (CoS) 提示:CoS提示通过将语言推理转换为结构化的符号形式(例如空间网格、表格或键值图)来增强 CoT。这种抽象化有助于进行中间符号操作,并且已被证明能够显著提高模型在空间推理、库存分类和基于图表的问题解决方面的性能。通过使用符号作为认知支架,模型能够更好地组织和处理复杂输入的内部表征。
  • Chain-of-Symbol (CoS) prompting: CoS prompting augments CoT by converting linguistic reasoning into a structured symbolic form, such as spatial grids, tables, or key-value maps. This abstraction facilitates intermediate symbolic manipulation and has been shown to significantly improve model performance on spatial reasoning, inventory classification, and diagram-based problem-solving. By using symbols as cognitive scaffolding, models are able to organize better and operate on internal representations of complex inputs.
  • 草稿提示:草稿提示指示模型在生成过程中显式地维护和更新中间变量。这模拟了人类在解决数学或逻辑问题时写出步骤的方式。它在以下情况下尤其有效:在数学、代码合成和数据转换等领域,中间状态跟踪对于正确性至关重要。暂存区在推理循环中既充当存储器,又充当验证机制。
  • Scratchpad prompting: Scratchpad prompting instructs the model to maintain and update intermediate variables explicitly during generation. This mirrors how humans write out steps when solving math or logic problems. It is particularly effective in domains such as mathematics, code synthesis, and data transformation, where intermediate state tracking is critical for correctness. The scratchpad acts as both memory and a validation mechanism within the reasoning loop.
  • ReAct提示:ReAct(Yao等人,2022)是一种混合提示策略,它将CoT推理与工具使用动作交织在一起。在该框架下,模型循环经历“思考|行动|观察”步骤,使其能够进行推理、与外部工具交互,并根据观察结果更新其信念。ReAct在智能学习模型(LLM)的开发中发挥了尤为重要的作用,使它们能够通过API执行交互式问题解决、检索增强生成RAG )和任务完成。
  • ReAct prompting: ReAct (Yao et al., 2022) is a hybrid prompting strategy that interleaves CoT reasoning with tool-use actions. In this framework, models cycle through thought | action | observation steps, enabling them to reason, interact with external tools, and update their beliefs based on observations. ReAct has been particularly impactful in the development of agentic LLMs, allowing them to perform interactive problem-solving, retrieval-augmented generation (RAG), and task completion via APIs.
  • 类型化CoT/类型化思维者:类型化CoT通过为每个步骤分配明确的推理类型(例如,因果推理、时间推理、空间推理、数学推理)来增强标准CoT。这种支架式方法提高了推理的多样性、模块化和可解释性,并使元模型能够为给定问题选择最合适的推理类型。最近的研究表明,这种方法提高了准确性和清晰度,尤其是在多领域和开放式推理任务中。
  • Typed CoT/typed thinker: Typed CoT enhances standard CoT by assigning explicit reasoning types (e.g., causal, temporal, spatial, mathematical) to each step. This type of scaffolding improves reasoning diversity, modularity, and interpretability, and allows meta-models to select the most suitable reasoning type for a given problem. Recent studies show this approach increases accuracy and clarity, especially in multi-domain and open-ended reasoning tasks.
  • 先生成后选择(基于重排序的提示):这种两阶段提示方法包括以下步骤:
    • 生成多个推理路径或答案。
    • 根据合理性、一致性或评分启发式方法选择或重新排序输出结果。

      这种方法通常与理论学习(ToT)或概念学习(CoT)结合使用,能够提高存在多种可能输出的任务(例如,开放式问题、创意写作、基于事实的问答)的可靠性。重新排序可以通过学习学习模型(LLM)自评估、外部评价模型或检索引导验证来实现。

  • Generate–then–select (reranking-based prompting): This two-phase prompting approach involves the following:
    • Generating multiple reasoning paths or answers.
    • Selecting or reranking the outputs based on plausibility, consistency, or scoring heuristics.

      This approach is often used in conjunction with ToT or CoT and improves reliability in tasks where multiple outputs are plausible (e.g., open-ended questions, creative writing, fact-based QA). Reranking can be implemented via LLM self-evaluation, external critic models, or retrieval-guided validation.

  • 多智能体提示(辩论和苏格拉底式推理):在这种策略中,多个学习型学习者(LLM)智能体扮演不同的角色(例如,提议者、质疑者、解释者),并进行基于对话的推理。该过程模拟辩论、同行评审或合作解决问题,从而产生更高质量、经过交叉验证的答案。这种方法促进探索性推理、冲突解决和多视角理解,并在伦理决策、政策分析和交互式辅导系统中展现出应用前景。
  • Multi-agent prompting (debate and socratic reasoning): In this strategy, multiple LLM agents adopt different roles (e.g., proposer, skeptic, explainer) and engage in dialogue-based reasoning. The process simulates debate, peer review, or cooperative problem-solving and leads to higher-quality, cross-validated answers. This method promotes exploratory reasoning, conflict resolution, and multi-perspective understanding, and has shown promise in ethical decision-making, policy analysis, and interactive tutoring systems.
  • 自动 CoT(自动 CoT 生成:自动 CoT通过使用启发式方法或小型模型,从现有的问答对中自动生成 CoT 示例,从而减少对手工示例的依赖。这提高了提示的可扩展性,支持领域自适应,并实现了零资源 CoT 微调。自动 CoT 可以通过系统地将模型暴露于分解的推理模式,从而在新领域中引导推理能力。

    这些先进的提示技术标志着GenAI系统在具备类人推理能力方面迈出了重要一步。通过整合结构化信息,多样化且工具增强的推理流程使语言学习模型(LLM)能够以更高的可靠性、透明度和上下文感知能力处理复杂任务。随着全人类人工智能(GenAI)系统日益融入现实世界的工作流程,使用这种以推理为中心的提示策略对于其稳健性和可信度至关重要。

  • Automatic-CoT (Auto-CoT generation: Auto-CoT reduces reliance on handcrafted exemplars by automatically generating CoT examples from existing QA pairs using heuristics or small models. This improves prompt scalability, supports domain adaptation, and enables zero-resource CoT fine-tuning. Auto-CoT can bootstrap reasoning abilities in new domains by systematically exposing the model to decomposed reasoning patterns.

    These advanced prompting techniques represent a significant step toward aligning GenAI systems with human-like reasoning capabilities. By incorporating structured, diverse, and tool-enhanced reasoning flows, they enable LLMs to handle complex tasks with greater reliability, transparency, and contextual awareness. As GenAI systems are increasingly integrated into real-world workflows, the use of such reasoning-centric prompting strategies will be central to their robustness and trustworthiness.

现在我们已经建立了全面的概念理解,接下来让我们实施两个不同的场景:

Now that we have established a comprehensive conceptual understanding, let us proceed to implement two distinct scenarios:

  • 重排序阶段的推理:此场景演示了如何在多模态 GenAI 系统中,使用 CoT 风格的重排序器在重排序阶段应用推理。
  • Reasoning at the reranking stage: This scenario demonstrates how reasoning is applied during the reranking phase using a CoT style reranker within a multimodal GenAI system.
  • 推荐阶段的推理:在这种情况下,推理是在推荐过程中进行的,其中见解是从多个异构数据集中得出的。
  • Reasoning at the recommendation stage: In this case, reasoning is employed during the recommendation process, where insights are derived from multiple heterogeneous datasets.

重排序阶段推理的架构

Architecture for reasoning at the reranking stage

下图展示了一种混合 RAG 架构,该架构通过结合交叉编码器和基于 LLM 的 CoT 推理进行重排序,从而提升结果的相关性。系统首先根据用户输入,利用向量数据库检索语义相似的文档,然后融合浅层语义相似性(通过交叉编码器)和基于深度推理的评分(通过 CoT 提示)来优化候选文档的排序。最后,将前 k 个重排序结果传递给 LLM 进行响应生成,从而生成上下文丰富、高度相关的响应,以满足用户复杂的查询需求。

The following figure illustrates a hybrid RAG architecture that enhances result relevance through combined reranking using both cross-encoders and LLM-based CoT reasoning. Starting from user input, the system retrieves semantically similar documents using a vector database, then refines the candidate ranking by fusing shallow semantic similarity (via cross-encoders) with deep reasoning-based scoring (via CoT prompts). The top-k reranked results are finally passed to an LLM for response generation, enabling contextually rich, highly relevant outputs tailored to complex user queries.

流程图展示了检索增强生成 (RAG) 过程:将查询嵌入,进行向量搜索,重新排序,并将排名靠前的结果与查询结合到大型语言模型 (LLM) 中以生成答案。

图 13.1:采用重排序以提高相关性的混合 RAG 架构

Figure 13.1: Hybrid RAG architecture with reranking for improved relevance

此处的重排序结合了图像相似度(1-距离)和文本相关性,后者通过语言模型(LLM)使用CoT提示进行评分。对于使用向量相似度检索到的每个候选规范,LLM会逐步推理该规范与用户查询的匹配程度,然后分配一个数值分数(0-1)。该CoT生成的分数与图像相似度分数通过加权平均(α = 0.5)进行融合。选择综合得分最高的候选规范。因此,CoT能够超越向量相似度,实现更深层次的语义理解,从而提高重排序的可解释性,并更好地契合用户意图。

The reranking here combines image similarity (1–distance) and textual relevance scored using a CoT prompt via a language model (LLM). For each candidate spec retrieved using vector similarity, the LLM is prompted to reason step-by-step about how well the specs meet the user query, then assigns a numeric score (0–1). This CoT-generated score is blended with the image similarity score using a weighted average (α = 0.5). The candidate with the highest combined score is selected. Thus, CoT enables deeper semantic understanding beyond vector similarity, improving reranking with interpretability and better alignment to user intent.

让我们深入了解代码。本节系统地解释了用 Python 实现的模块化 GenAI 流水线,该流水线专为多模态查询处理而设计,具体来说,它利用图像和文本模态将用户查询与笔记本电脑规格进行匹配。该系统基于 ChromaDB(用于向量存储)、CLIP(用于图像和文本嵌入)以及 LangGraph(用于基于 CoT 的重排序的智能体执行)等关键组件构建而成。

Let us understand the code in depth. this section provides a systematic explanation of a modular GenAI pipeline implemented in Python, designed for multimodal query handling, specifically matching user queries with laptop specifications using both image and text modalities. The system is built upon key components such as ChromaDB for vector storage, CLIP for image and text embeddings, and LangGraph for agentic execution with CoT-based reranking.

以下目录结构展示了一个多模态 RAG 系统的实现,该系统采用双阶段重排序,并结合了交叉编码器评分和基于 LLM 的推理。该架构旨在支持图像和文本模态的混合检索,并通过 LangGraph 代理集成了嵌入、索引、重排序和编排等核心组件。这种模块化布局确保了对不同重排序策略和多模态检索工作流程的灵活试验。

The following directory structure represents the implementation of a multimodal RAG system that incorporates dual-stage reranking, leveraging both cross-encoder scoring and LLM-based reasoning. Designed to support hybrid retrieval with image and text modalities, the architecture integrates core components for embedding, indexing, reranking, and orchestration via LangGraph agents. This modular layout ensures flexibility for experimenting with different reranking strategies and multimodal retrieval workflows.

终端中代码目录的屏幕截图,显示了文件夹和 Python 文件,包括“chroma_storage”、“retrievers”、“rag”和“cross_encoder_reranker.py”。一条消息显示:“用于重排序的新模块。”

图 13.2:重排序阶段推理的文件夹结构

Figure 13.2: Folder structure of reasoning at the reranking stage

模块:loaders.py

Module: loaders.py

该模块封装了用于从磁盘加载原始数据的 I/O 实用程序函数:

This module encapsulates I/O utility functions for loading raw data from disk:

  • load_text_documents(folder):此函数递归扫描指定的文件夹,识别所有.txt文件,并将它们的内容加载到字典中:

    def load_text_documents(folder):

    文档 = {}

    for file in os.listdir(folder):

    如果文件以“txt”结尾:

    with open(os.path.join(folder, file), "r", encoding="utf-8") as f:

    docs[file] = f.read()

    返回文档

    此功能确保正确读取笔记本电脑的文本规格,并准备好嵌入。

  • load_text_documents(folder): This function recursively scans a specified folder, identifies all .txt files, and loads their content into a dictionary:

    def load_text_documents(folder):

    docs = {}

    for file in os.listdir(folder):

    if file.endswith(".txt"):

    with open(os.path.join(folder, file), "r", encoding="utf-8") as f:

    docs[file] = f.read()

    return docs

    This function ensures that textual specifications for laptops are appropriately read and prepared for embedding.

  • load_image_paths(folder):此函数从指定的目录中识别图像文件(扩展名为.jpg .jpeg .png ),并返回它们的绝对路径:

    def load_image_paths(folder):

    返回 [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    这对于准备用于嵌入和索引的图像路径至关重要。

  • load_image_paths(folder): This function identifies image files (with .jpg, .jpeg, .png extensions) from the specified directory and returns their absolute paths:

    def load_image_paths(folder):

    return [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    This is crucial for preparing image paths for embedding and indexing.

模块:embedding_utils.py

Module: embedding_utils.py

该模块提供对OpenAI CLIP模型的访问,用于将文本和图像嵌入到共享的向量空间中。为了提高计算效率,模型和处理器只需全局加载一次:

This module provides access to OpenAI’s CLIP model to embed both text and images into a shared vector space. The model and processor are loaded once globally for computational efficiency:

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

embed_text_ollama(text)

embed_text_ollama(text)

使用 CLIP 将给定的文本字符串处理并编码为 512 维嵌入:

Processes and encodes a given text string into a 512-dimensional embedding using CLIP:

def embed_text_ollama(text):

def embed_text_ollama(text):

inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)

inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)

使用 torch.no_grad():

with torch.no_grad():

outputs = clip_model.get_text_features(**inputs)

outputs = clip_model.get_text_features(**inputs)

返回 outputs[0].tolist()

return outputs[0].tolist()

embed_image_ollama(image_path)

embed_image_ollama(image_path)

将图像(从磁盘加载)编码为 512 维嵌入向量:

Encodes an image (loaded from disk) into a 512-dimensional embedding vector:

def embed_image_ollama(image_path):

def embed_image_ollama(image_path):

image = Image.open(image_path).convert("RGB")

image = Image.open(image_path).convert("RGB")

inputs = clip_processor(images=image, return_tensors="pt")

inputs = clip_processor(images=image, return_tensors="pt")

使用 torch.no_grad():

with torch.no_grad():

outputs = clip_model.get_image_features(**inputs)

outputs = clip_model.get_image_features(**inputs)

返回 outputs[0].tolist()

return outputs[0].tolist()

这些嵌入使得跨模态的语义比较成为可能。

These embeddings allow for semantic comparisons across modalities.

模块:index_builder.py

Module: index_builder.py

此脚本使用 ChromaDB 构建向量索引。它执行以下步骤:

This script builds the vector index using ChromaDB. It performs the following steps:

1. 实例化 Chroma 客户端:要开始索引过程,我们首先需要建立与持久化 Chroma 客户端的连接:

1. Instantiate Chroma client: To begin the indexing process, we first establish a connection to the persistent Chroma client:

client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)

2. 创建或重置集合:文本和图像集合将被(重新)初始化,以避免数据过时:

2. Create or reset collections: Text and image collections are (re)initialized to avoid stale data:

如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:

if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:

client.delete_collection(name=CHROMA_TEXT_COLLECTION)

client.delete_collection(name=CHROMA_TEXT_COLLECTION)

3. 索引文本数据:通过load_text_documents()加载的文档将被嵌入并添加到 Chroma 文本集合中:

3. Index text data: Documents loaded via load_text_documents() are embedded and added to the Chroma text collection:

text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])

4. 索引图像数据:类似地,图像路径被加载和嵌入,仅使用元数据(文件名)作为引用:

4. Index image data: Similarly, image paths are loaded and embedded, with only metadata (filename) used for reference:

image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])

该模块确保所有资源都被嵌入并存储,以便下游检索。

This module ensures that all resources are embedded and stored for downstream retrieval.

模块:reranker.py

Module: reranker.py

该模块使用交叉编码器模型,根据查询和检索到的元数据文件名之间的语义相似性进行重排序。

This module uses a cross-encoder model for reranking based on semantic similarity between the query and retrieved metadata filenames.

  • rerank(query, metadatas):构建查询文件名和元数据文件名之间的成对比较,并使用CrossEncoder进行重新排序

    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(query, metadatas):

    pairs = [(query, doc.get("file", "")) for doc in metadatas]

    scores = cross_encoder.predict(pairs)

    排名 = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)

    返回 [doc for doc, _ in ranking]

    虽然langgraph_agent.py中未使用,但根据应用情况,它可以补充或替代基于 LLM 的重排序。

  • rerank(query, metadatas): Constructs pairwise comparisons between the query and metadata file names and reranks using CrossEncoder:

    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(query, metadatas):

    pairs = [(query, doc.get("file", "")) for doc in metadatas]

    scores = cross_encoder.predict(pairs)

    ranked = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)

    return [doc for doc, _ in ranked]

    Although not used in langgraph_agent.py, it can complement or replace LLM-based reranking depending on the application.

模块:langgraph_agent.py

Module: langgraph_agent.py

该模块使用 LangGraph 定义了一个结构化代理,用于执行基于 CoT 推理的多步骤检索和重排序过程。该工作流程包含三个主要阶段:嵌入、重排序和读取。

This module defines a structured agent using LangGraph to execute a multi-step retrieval and reranking process using CoT reasoning. The workflow follows three primary stages: embedding, reranking, and reading.

langgraph_agent.py 模块的代理特性

Agentic characteristics of the langgraph_agent.py module

langgraph_agent.py模块展示了一个基于结构化决策、模块化规划和多模态推理原则的智能体系统架构。该系统利用 LangGraph 框架,实现了一个有状态的流水线来处理用户查询,检索语义相似的候选对象,并使用混合评分机制对它们进行重新排序。

The langgraph_agent.py module exemplifies an agentic system architecture grounded in the principles of structured decision-making, modular planning, and multimodal reasoning. Leveraging the LangGraph framework, the system implements a stateful pipeline to process user queries, retrieve semantically similar candidates, and rerank them using a hybrid scoring mechanism.

该智能体的设计融合了多项先进功能,共同实现了智能、多模态和情境感知决策。其主要架构特征包括:

The agent’s design incorporates several advanced capabilities that collectively enable intelligent, multimodal, and context-aware decision-making. The key architectural features include the following:

  • 有状态执行和模块化设计智能体的核心是一个基于 LangGraph 的状态机,它使用状态图 (StateGraph)定义,该状态图协调节点在包含嵌入、重排序和读取阶段的线性流水线中的转换。每个节点转换并传播一个共享的可变状态字典,从而支持智能体内存并实现顺序任务执行。这与智能体范式相一致,在智能体范式中,决策取决于系统不断演变的内部状态。
  • Stateful execution and modular design: At the core of the agent is a LangGraph-based state machine, defined using a StateGraph that orchestrates node transitions across a linear pipeline comprising embed, rerank, and read stages. Each node transforms and propagates a shared mutable state dictionary, supporting agent memory and enabling sequential task execution. This aligns with agentic paradigms wherein decisions are conditioned on the evolving internal state of the system.
  • 基于认知理论提示的推理:该智能体的一个显著特点是利用认知理论提示来指导其重排序行为。具体来说,函数llm_score()会调用一个语言模型,逐步推理用户查询与候选笔记本电脑规格之间的匹配度。该模型需要输出解释性理由和一个归一化的相关性得分。这模拟了认知智能体特有的内部思考过程。
  • Reasoning via CoT prompting: A salient feature of the agent is the use of CoT prompting to guide its reranking behavior. Specifically, the function llm_score() invokes a language model to reason step-by-step about the alignment between the user query and candidate laptop specifications. The model is prompted to output both an explanatory rationale and a normalized relevance score. This mimics the internal deliberation process characteristic of cognitive agents.
  • 多模态决策与融合策略:该智能体通过使用 CLIP 嵌入融合了图像和文本两种模态。在重排序过程中,它通过融合图像相似度得分(由向量距离得出)和基于 CoT 的文本得分来计算综合相关性得分,如下式所示:
  • Multimodal decision-making and fusion strategy: The agent incorporates both image and text modalities through the use of CLIP embeddings. During reranking, it computes a combined relevance score by fusing the image similarity score (derived from vector distance) and the CoT-based text score as shown in the following formula:

综合得分 = α · 图片得分 + (1 − α) · 文本得分

Combined score = α · image score + (1 − α) · text score

该决策策略体现了多模态推理,使智能体能够通过加权证据聚合自主确定最相关的候选人。

This decision policy exemplifies multimodal reasoning, allowing the agent to autonomously determine the most relevant candidate through weighted evidence aggregation.

代理属性和功能

Agentic attributes and functionality

该系统展现了智能体架构的关键特性:

The system demonstrates key properties of an agentic architecture:

  • 通过在节点间传递可变字典来进行状态管理。
  • State management via a mutable dictionary passed across nodes.
  • 基于学习到的嵌入和LLM生成的评估的自主决策。
  • Autonomous decision-making based on learned embeddings and LLM-generated evaluations.
  • 将任务分解为模块化和可重用的图节点。
  • Task decomposition into modular and reusable graph nodes.
  • 通过记录推理过程和选定分数进行可解释性分析。
  • Explainability through logged reasoning and selected scores.
  • 当找不到合适的候选人时,通过回退逻辑来提高鲁棒性。
  • Robustness via fallback logic when suitable candidates are not identified.

尽管当前设计遵循线性执行路径,不涉及动态分支或工具使用,但其底层架构具有可扩展性,能够支持条件转换、工具调用以及更复杂的智能体行为。该智能体基于 LangGraph 构建,其中每个节点执行特定功能,包括嵌入输入和使用基于 LLM 的推理对候选对象进行重新排序。该架构通过模块化状态转换和结构化提示工程,实现了可解释的多模态决策。关键组件包括:

Although the current design follows a linear execution path without dynamic branching or tool use, the underlying architecture is extensible to support conditional transitions, tool invocation, and more complex agent behaviors. The agent is built upon a LangGraph-based execution flow, where each node performs a specialized function, ranging from embedding inputs to reranking candidates using LLM-based reasoning. The architecture enables interpretable, multimodal decision-making through modular state transitions and structured prompt engineering. The key components include the following:

  • 使用 CoT 进行 LLM 评分llm_score()函数提示模型评估给定规范与查询的匹配程度。提示旨在引导模型逐步推理,并给出数值评分:

    def llm_score(query: str, specs_text: str) -> tuple[float, str]:

    提示 = (

    “评估这些笔记本电脑配置在多大程度上满足用户需求。”

    “先一步一步思考,然后输出两行代码:\n”

    “理由:<您的分析>\n”

    得分:<0 到 1 之间的单个数字>\n\n

    f"用户请求:{query}\n\n笔记本电脑规格:\n{specs_text}"

    ...

  • LLM scoring with CoT: The llm_score() function prompts the model to evaluate how well a given spec matches the query. The prompt is designed to elicit step-by-step reasoning followed by a numeric score:

    def llm_score(query: str, specs_text: str) -> tuple[float, str]:

    prompt = (

    "Evaluate how well these laptop specs satisfy the user request.\n"

    "First think step-by-step, then output exactly two lines:\n"

    "Reasoning: <your analysis>\n"

    "Score: <single number between 0 and 1>\n\n"

    f"User request: {query}\n\nLaptop specs:\n{specs_text}"

    )

    ...

这里使用的LLM是ChatOllama ,采用“mistral”模型。

The LLM used here is ChatOllama with the "mistral" model.

  • LangGraph 节点:LangGraph 代理定义了三个核心节点,分别是嵌入、重排序和读取,每个节点负责多模态检索和决策流程中的一个不同步骤,详情如下:
    • node_embed(state):嵌入输入查询以及可选的图像。如果两者都存在,则计算向量平均值,并查询 Chroma 以获取最接近的 5 个文本规格:

      vec = [(a + b) / 2 for a, b in zip(text_vec, img_vec)] if img_vec else text_vec

      res = client.get_collection(CHROMA_TEXT_COLLECTION).query(query_embeddings=[vec], ... )

    • node_rerank_llm(state):执行一种结合了以下功能的多模态评分策略:

      图像得分 = 1 – 距离

      text_score = llm_score(query, spec)

      综合得分 = α·图像得分 + (1–α)·文本得分

      该分数用于选择最合适的规格:

      综合得分 = alpha * 图片得分 + (1 - alpha) * 文本得分

      为了便于解释,CoT 推理过程也会被记录下来。

    • node_read(state):读取排名最高的规范文本及其关联的镜像路径。如果其中任何一个不可用,它会优雅地处理缺失数据。
  • LangGraph nodes: The LangGraph agent defines three core nodes, which are embed, rerank, and read, and each is responsible for a distinct step in the multimodal retrieval and decision pipeline, details as follows:
    • node_embed(state): Embeds the input query and optionally an image. It computes a vector average if both are present, and queries Chroma for top-5 nearest text specs:

      vec = [(a + b) / 2 for a, b in zip(text_vec, img_vec)] if img_vec else text_vec

      res = client.get_collection(CHROMA_TEXT_COLLECTION).query(query_embeddings=[vec], ...)

    • node_rerank_llm(state): Performs a multimodal scoring strategy that combines:

      image_score = 1 – distance

      text_score = llm_score(query, spec)

      combined_score = α·image_score + (1–α)·text_score

      This score is used to select the most appropriate spec:

      combined = alpha * img_score + (1 - alpha) * text_score

      The CoT reasoning is also logged for interpretability.

    • node_read(state): Reads the text of the best-ranked spec and its associated image path. If either is unavailable, it handles missing data gracefully.
  • 图构建:使用状态图,通过显式转换定义代理流程:

    builder.add_node("Embed", node_embed)

    builder.add_node("Rerank", node_rerank_llm)

    builder.add_node("读取", node_read)

    builder.set_entry_point("Embed")

    builder.add_edge("嵌入", "重新排序")

    builder.add_edge("重新排序", "读取")

    builder.set_finish_point("读取")

    graph = builder.compile()

  • Graph construction: Using StateGraph, the agent flow is defined with explicit transitions:

    builder.add_node("Embed", node_embed)

    builder.add_node("Rerank", node_rerank_llm)

    builder.add_node("Read", node_read)

    builder.set_entry_point("Embed")

    builder.add_edge("Embed", "Rerank")

    builder.add_edge("Rerank", "Read")

    builder.set_finish_point("Read")

    graph = builder.compile()

  • 代理 API :最后一个可调用方法execute_graph_agent()使用查询和可选的图像向量初始化图:

    def execute_graph_agent(user_query: str, image_vec: list[float] | None = None) -> str:

    res = graph.invoke({"input": user_query, "image_vec": image_vec})

    ...

  • Agent API: The final callable method execute_graph_agent() initiates the graph with a query and an optional image vector:

    def execute_graph_agent(user_query: str, image_vec: list[float] | None = None) -> str:

    res = graph.invoke({"input": user_query, "image_vec": image_vec})

    ...

它返回一个格式化的输出,其中包括所选规格、图像路径和推理日志。

It returns a formatted output that includes the selected specs, image path, and reasoning log.

该代码库提供了一个模块化程度高且可扩展的多模态信息检索和推荐系统。值得注意的是,langgraph_agent.py集成了 LangGraph,用于协调语义检索和 CoT 重排序流程,从而使系统能够在用户查询和多模态内容之间生成可解释且稳健的匹配结果。加载器、嵌入模块、索引和重排序之间的职责分离确保了系统在笔记本电脑以外的各种领域(例如电子商务、教育或医疗保健)的可重用性和可维护性。

The codebase presents a well-modularized and extensible system for multimodal information retrieval and recommendation. Notably, langgraph_agent.py integrates LangGraph for orchestrating a pipeline with semantic retrieval and CoT reranking, thus allowing the system to produce explainable and robust matches between user queries and multimodal content. The separation of concerns across loaders, embedding modules, indexing, and reranking ensures reusability and maintainability across various domains beyond laptops, such as e-commerce, education, or healthcare.

完整的代码可以在本书的 GitHub 存储库中找到,位于推荐阶段的推理部分。

The full code can be found in the GitHub repository of this book, under the section reasoning at the recommendation stage.

在探索了如何在多模态 GenAI 系统中通过 CoT 推理重排序器在重排序阶段利用推理之后,我们现在将注意力转移到流程中另一个同样重要的阶段——推荐。

Having explored how reasoning is leveraged during the reranking phase through a CoT reasoning reranker in a multimodal GenAI system, we now shift our focus to a different but equally critical stage in the pipeline, recommendation.

在下一个场景中,推理在综合多个异构数据集的见解以生成个性化和情境感知推荐方面发挥着关键作用。

In this next scenario, reasoning plays a pivotal role in synthesizing insights across multiple heterogeneous datasets to generate personalized and context-aware recommendations.

推荐阶段推理的架构

Architecture for reasoning at the recommendation stage

下图展示了个性化 RAG 流程的完整流程,该流程将结构化目录数据、用户偏好配置文件和元数据集成到一个统一的矢量数据库中。所有数据集均被分块,使用共享的嵌入模型进行嵌入,并存储在矢量存储库中。在查询时,系统执行混合检索(结合了最佳匹配 25 ( BM25 ) 和密集矢量搜索),然后使用基于交叉编码器的重排序器进行细粒度评分。

The following figure illustrates the complete flow of a personalized RAG pipeline that integrates structured catalogue data, user preference profiles, and metadata into a unified vector database. All datasets are chunked, embedded using a shared embedding model, and stored in a vector store. At query time, the system performs hybrid retrieval (combining Best Matching 25 (BM25) and dense vector search), followed by a cross-encoder-based reranker for fine-grained scoring.

流程图展示了嵌入模型如何处理目录数据、用户偏好和元数据。查询从向量数据库中检索结果,然后进行混合搜索、重新排序,并输出排名靠前的结果。

图 13.3:推荐阶段推理的架构

Figure 13.3: Architecture for reasoning at the recommendation stage

数据集

The dataset

本代码使用了三个数据集,与第 12 章“高级多模态 GenAI 系统”中的代码一起分享,其中包含 CoT、推理和重排序推荐引擎以及 LLM,代码第 2 部分。

Three datasets were used in this code, which has been shared along with the code in Chapter 12, Advanced Multimodal GenAI Systems, with CoT, reasoning, and reranking Recommendation engine with LLM, code part 2.

以下列表概述了三个不同的数据集:

The following list outlines the three different datasets:

  • Updated_Synthetic_Dataset__500_Rows_.csv:
    • 用途:这是包含 500 行合成内容项的主要内容数据库。
    • 使用案例
      • 用于构建矢量存储(色度)。
      • 使用OllamaEmbeddings嵌入
      • 根据用户提示(例如,怀旧情绪)进行查询和检索。
    • 内容包括标题类型子类型主题现场活动标志内容类别年龄评级​​等。
  • Updated_Synthetic_Dataset__500_Rows_.csv:
    • Purpose: This is the main content database containing 500 rows of synthetic content items.
    • Use case:
      • Used to build the vector store (Chroma).
      • Embedded using OllamaEmbeddings.
      • Queried and retrieved based on user prompts (e.g., nostalgic mood).
    • Contents include: title, genre, sub_genre, theme, live_event_flag, content_category, age_rating, etc.
  • User_Preference_Profiles.csv:
    • 目的:该数据集定义了用户的偏好,即喜欢和不喜欢。
    • 使用案例
      • 将传入的查询与具有相似喜好的用户进行匹配(例如,用户 16、52、19)。
      • 有助于指导筛选逻辑(例如,家庭内容、成长主题)。
  • User_Preference_Profiles.csv:
    • Purpose: This dataset defines user preferences, i.e., likes and dislikes.
    • Use case:
      • Matches incoming queries to users with similar tastes (e.g., user 16, 52, 19).
      • Helps guide filtering logic (e.g., family content, coming-of-age themes).
  • synthetic_dataset_metadata.csv:
    • 目的:描述主数据集(元数据字典)中每一列的含义。
    • 使用案例
      • 作为人类和LLM可解释性的文档层。
      • 可选择性地包含在系统提示中或用于字段验证。
  • synthetic_dataset_metadata.csv:
    • Purpose: Describes the meaning of each column in the main dataset (metadata dictionary).
    • Use case:
      • Acts as a documentation layer for human and LLM interpretability.
      • Can be optionally included in system prompts or for field validation.

以下目录结构概述了rag_llm_memory_project的架构,这是一个模块化的 RAG 系统,增强了长期记忆、个性化分析和多模态推理能力。每个文件夹封装了流程中的一个关键功能层,从嵌入和检索到编排、推理提示和重排序,从而实现可扩展的、上下文感知的内容推荐和多样化的数据模态。

The following directory structure outlines the architecture of the rag_llm_memory_project, a modular RAG system enhanced with long-term memory, personalized profiling, and multimodal reasoning capabilities. Each folder encapsulates a key functional layer of the pipeline, from embedding and retrieval to orchestration, reasoning prompts, and reranking, enabling scalable, context-aware content recommendations and diverse data modalities.

rag_tlm_memory_project 的文件目录树截图,显示了与 RAG 系统、嵌入、向量存储、LLM、性能分析、数据集、排名和需求文件相关的文件夹和 Python 脚本。

图 13.4:推荐阶段推理的文件夹结构

Figure 13.4: Folder structure for reasoning at the recommendation stage

推荐引擎的目标

Goal of the recommendation engine

该推荐引擎的目标是通过CoT推理过程解读自然语言提示,从而提供符合上下文的内容。以下展示了一个基于用户提示的典型执行场景:

The goal of this recommendation engine is to deliver contextually appropriate content by interpreting natural language prompts through a CoT reasoning process. The following illustrates a representative execution scenario based on the user prompt:

用户输入提示

User input prompt:

我正在寻找符合我当前心情(有点怀旧)的内容,我想和我17岁的女儿一起观看。

I’m looking for content that matches my mood, which is currently nostalgic, and I want to watch with my 17-year-old daughter.

以下列表概述了系统的推理和执行步骤:

The following list outlines the system's reasoning and execution steps:

1. 情绪识别:系统解读了查询的情绪基调,并将用户当前的情绪归类为怀旧。

1. Mood identification: The system interpreted the emotional tone of the query and categorized the user's current mood as nostalgic.

2. 受众分析:助手意识到内容必须既适合用户也适合 17 岁的观众,因此强制执行适合家庭观看的限制。

2. Audience analysis: The assistant recognized that the content must be suitable for both the user and a 17-year-old viewer, thus enforcing family-friendly constraints.

3. 类型映射:怀旧的情绪通过算法与成长题材子类型相关联,而成长题材子类型通常与反思和情感主题相一致。

3. Genre mapping: The mood nostalgic, was algorithmically associated with the coming-of-age sub-genre, which typically aligns with reflective and emotional themes.

4. 人口统计兼容性:该助手优先考虑具有跨世代吸引力的内容,以引起青少年和成年观众共鸣的故事为目标。

4. Demographic compatibility: The assistant prioritized content with cross-generational appeal, targeting narratives resonant with both teenage and adult audiences.

5. 用户偏好分析:该引擎参考了一个偏好数据库,检查了 16 岁、52 岁、19 岁、77 岁和 82 岁用户的个人资料,以筛选出那些喜欢家庭和成长内容的用户。

5. User preference profiling: The engine referenced a preference database, examining profiles of users aged 16, 52, 19, 77, and 82 to filter for those favoring family and coming-of-age content.

6. 交叉分析:确定了一个重点用户子集(ID:16、52、19),他们的偏好与类型和受众标准都相符。

6. Intersection analysis: A focused subset of users (IDs: 16, 52, 19) was identified whose preferences aligned with both genre and audience criteria.

7. 主题丰富化:提取了筛选后的用户群体中常见的主题偏好,突出了勇气、爱情和冒险作为主要的叙事元素。

7. Thematic enrichment: Common thematic preferences across the filtered user group were extracted, highlighting courage, love, and adventure as dominant narrative elements.

8. 内容类型过滤:系统根据推断的观看上下文,通过应用逻辑约束( live_event_flag = False )排除直播活动内容。

8. Content-type filtering: The system excluded live event content by applying a logical constraint (live_event_flag = False), in accordance with inferred viewing context.

最终检索限制

Final retrieval constraints

该系统将查询形式化为以下结构化检索规范:

The system formalized the query as the following structured retrieval specification:

  • content_category = "家庭"
  • content_category = "Family"
  • 子类型 = "成长故​​事"
  • sub_genre = "Coming-of-Age"
  • 主题{勇气,爱,冒险}
  • themes {Courage, Love, Adventure}
  • live_event_flag = False
  • live_event_flag = False

以下是最终输出语句:

The following is the final output statement:

检索以成长题材为重点的非直播家庭内容,融入勇气、爱情和冒险的主题。

Retrieve non-live Family content with a focus on the Coming-of-Age sub-genre, incorporating themes of Courage, Love, and Adventure.

模块化代码库分解

Modular codebase breakdown

该系统是一个模块化的 RAG 流水线,旨在通过整合结构化用户画像数据、混合检索机制(BM25 + 密集向量)、基于交叉编码器模型的重排序以及 LLM 推理,提供个性化的内容推荐。该流水线基于 LangChain、ChromaDB、Ollama、Transformers 和 PyTorch 构建,支持动态检索、推理以及基于用户特定内存的推荐生成。

This system is a modular RAG pipeline designed to deliver personalized content recommendations by integrating structured profile data, hybrid retrieval mechanisms (BM25 + dense vectors), reranking via cross-encoder models, and LLM reasoning. The pipeline is built using LangChain, ChromaDB, Ollama, Transformers, and PyTorch, enabling dynamic retrieval, reasoning, and user-specific memory-based generation.

以下列表概述了构成 RAG 助手主干的模块化组件,每个组件负责检索和生成工作流程中的特定阶段,从数据加载和向量索引到混合检索、重排序、推理和答案生成:

The following list outlines the modular components forming the backbone of the RAG assistant, each responsible for a specific stage in the retrieval and generation workflow, from data loading and vector indexing to hybrid retrieval, reranking, reasoning, and answer generation:

  • app/main.py
    • 目的:RAG 助手交互循环的入口点。
    • 主要职责
      • 接受来自命令行的用户查询。
      • 通过invoke()调用rag_chain ,并打印答案文档和源文档。
  • app/main.py:
    • Purpose: Entry point for the RAG assistant interaction loop.
    • Key responsibilities:
      • Accepts user queries from the command line.
      • Calls the rag_chain via invoke() and prints both the answer and source documents.
  • app/config.py
    • 用途:存储全局配置常量。
    • 包括
      • 模型名称(LLM 和嵌入模型)。
      • 矢量数据库和数据源的路径。
  • app/config.py:
    • Purpose: Stores global configuration constants.
    • Includes:
      • Model names (LLM and embedding models).
      • Paths to vector databases and data sources.
  • embeddings/embedder.py
    • 目的:初始化嵌入模型。
    • 实施
      • OllamaEmbeddings用于使用类似nomic-embed-text 的局部嵌入模型将文本块转换为稠密向量
  • embeddings/embedder.py:
    • Purpose: Initializes the embedding model.
    • Implements:
      • OllamaEmbeddings for converting text chunks into dense vectors using a local embedding model like nomic-embed-text.
  • vectorstore/db_handler.py
    • 目的:使用 ChromaDB 创建并持久化向量存储。
    • 细节
      • 加载文档并将其嵌入。
      • 将它们存储在持久目录(VECTOR_DB_PATH )中。
  • vectorstore/db_handler.py:
    • Purpose: Creates and persists the vector store using ChromaDB.
    • Details:
      • Loads documents and embeds them.
      • Stores them in a persistent directory (VECTOR_DB_PATH).
  • vectorstore/metadata_schema.py
    • 目的:在进行矢量索引之前,将元数据附加到每个文档块。
    • 元数据示例
      • 从文件名中提取“source”字段,以追踪结果的来源。
  • vectorstore/metadata_schema.py:
    • Purpose: Attaches metadata to each document chunk before vector indexing.
    • Metadata example:
      • "source" field from the file name to trace results to their origin.
  • retriever/hybrid_search.py ​​:
    • 目的:实现使用 BM25 + 向量相似度的混合检索。
    • 细节
      • 使用EnsembleRetriever合并稀疏 (BM25) 和密集 (Chroma) 分数。
      • 有助于有效回忆词汇和语义匹配项。
  • retriever/hybrid_search.py:
    • Purpose: Implements hybrid retrieval using BM25 + vector similarity.
    • Details:
      • Uses EnsembleRetriever to combine sparse (BM25) and dense (Chroma) scores.
      • Helps recall both lexical and semantic matches effectively.
  • reranker/cross_encoder.py
    • 目的:使用预训练的交叉编码器模型对检索到的段落进行重新排序。
    • 使用的型号:交叉编码器/ms-marco-MiniLM-L-6-v2
    • 工作流程
      • 它使用 BERT 风格的评分方法计算每个查询-文档对的相关性得分。
      • 返回生成器使用的排名前 k 的重新排序段落。
  • reranker/cross_encoder.py:
    • Purpose: Reranks retrieved passages using a pretrained cross-encoder model.
    • Model used: cross-encoder/ms-marco-MiniLM-L-6-v2
    • Workflow:
      • For each query-document pair, it computes a relevance score using BERT-style scoring.
      • Returns the top-k reranked passages used by the generator.
  • llm/generate.py
    • 目的:初始化用于生成最终答案的 LLM。
    • 使用的模型奥拉玛
    • 特征
      • 本地推理、低延迟、可配置温度和解码设置。
  • llm/generate.py:
    • Purpose: Initializes the LLM used for final answer generation.
    • Model used: Ollama
    • Features:
      • Local inference, low-latency, configurable temperature, and decoding settings.
  • llm/react_prompt.py
    • 目的:提供 ReAct 风格的提示,用于对检索到的文档进行推理。
    • 格式
      • 首先,请LLM列出推理步骤。
      • 然后,根据推理和上下文得出答案。
  • llm/react_prompt.py:
    • Purpose: Provides a ReAct-style prompt for reasoning over the retrieved documents.
    • Format:
      • First, ask the LLM to list reasoning steps.
      • Then, to produce an answer based on reasoning and context.
  • llm/system_prompt.py
    • 目的:定义系统提示,作为助手的操作指南。
    • 设想
      • 解读结构化的用户个人资料,并根据上下文(例如,情绪、陪伴对象、年龄适宜性)调整推荐内容。
  • llm/system_prompt.py:
    • Purpose: Defines the system prompt that acts as the instruction guide for the assistant.
    • Scenario:
      • Interprets structured user profiles and adapts recommendations to context (e.g., mood, companion, age appropriateness).
  • orchestrator/rag_chain.py
    • 目的:将所有模块集成到一个完整的 LangChain RAG 流水线中。
    • 管道
      1. 加载文档并将其分块。
      2. 嵌入并存储它们。
      3. 使用混合检索方式获取相关数据块。
      4. 应用交叉编码器重排序器。
      5. 使用 ReAct 提示符运行最终生成。
      6. 存储对话记忆。
      7. 返回答案及来源。
  • orchestrator/rag_chain.py:
    • Purpose: Integrates all modules into a complete LangChain RAG pipeline.
    • Pipeline:
      1. Load documents and chunk them.
      2. Embed and store them.
      3. Use hybrid retrieval to fetch relevant chunks.
      4. Apply the cross-encoder reranker.
      5. Run the final generation with a ReAct prompt.
      6. Store conversation memory.
      7. Return the answer and sources.
  • memory/conversation_buffer.py
    • 目的:维护跨回合的对话历史记录。
    • 用途
      • 实现多轮对话中的上下文响应。
      • 仅保存助手生成的响应,不保存检索源。
  • memory/conversation_buffer.py:
    • Purpose: Maintains conversational history across turns.
    • Used for:
      • Enabling contextual responses in multi-turn dialogue.
      • Only the assistant’s generated response is saved, not the retrieval sources.
  • utils/data_loader.py
    • 用途:加载和解析 CSV 数据集。
    • 支持的数据集
      • 已更新的合成数据集__500行
      • 用户偏好配置文件
      • 合成数据集元数据
    • 功能:将行拆分成块,并返回 LangChain Document对象以进行向量化。
  • utils/data_loader.py:
    • Purpose: Loads and parses CSV datasets.
    • Datasets supported:
      • Updated_Synthetic_Dataset__500_Rows_
      • User_Preference_Profiles
      • synthetic_dataset_metadata
    • Function: Splits rows into chunks and returns LangChain Document objects for vectorization.

输出结果是来自对话式推荐系统的结构化推理轨迹,该系统根据用户偏好定制电影推荐。涉及两位用户;详情如下:

The output is a structured reasoning trace from a conversational recommender system that tailors film suggestions based on user preferences. Two users are involved; details as follows:

  • 用户 User_82想找适合 17 岁青少年观看的 90 年代成长题材和家庭剧,避开科幻和喜剧等类型,不喜欢友情和家庭纽带等主题
  • User_82 wants 90s coming-of-age and family dramas suitable for a 17-year-old, avoiding genres like Sci-Fi and Comedy, and dislike’s themes like friendship and family bonds.
  • 用户_19请求推荐适合 13 岁青少年及其伴侣观看的轻松愉快的 80 年代电影(浪漫喜剧或生活片段)。
  • User_19 requests light-hearted 80s films (romantic comedies or slice-of-life) suitable for a 13-year-old and their partner.

下图展示了辅助筛选器,它会根据用户的喜好检索并推荐影片。例如,它会推荐《七宝奇谋》《E.T.外星人》和《回到未来》等经典影片,因为这些影片符合用户怀旧、合家欢和适龄的偏好。

The following figure depicts the assistant filters, which retrieve and present recommendations accordingly. It suggests classics like The Goonies, E.T., and Back to the Future as they align with the nostalgic, family-friendly, and age-appropriate preferences.

一张深色主题界面的截图显示了两个用户提示,分别关于 80 年代和 90 年代的轻松愉快的青少年浪漫喜剧或生活片段电影,以及相应的详细助手回复和最终推荐答案。

图 13.5 GenAI 推荐引擎对两位用户的输出

Figure 13.5: Output from the GenAI recommendation engine for two users

该系统提出了一种先进的 RAG 流水线,能够无缝集成基于用户画像和基于内容的推荐策略,并通过混合检索和交叉编码器重排序来提升个性化程度,从而提高推荐的相关性。通过整合对话记忆和 ReAct 式提示,该系统能够提供智能的、上下文感知的、根据用户偏好量身定制的响应。此外,该架构的设计具有良好的可扩展性,未来可集成多模态输入或实时流数据源。

This system presents an advanced RAG pipeline that seamlessly integrates profile-based and content-based recommendation strategies, enhancing personalization through the use of hybrid retrieval and cross-encoder reranking for improved relevance. By incorporating conversational memory and ReAct-style prompting, it enables intelligent, context-aware responses tailored to user preferences. Additionally, the architecture is designed for extensibility, allowing future integration with multimodal inputs or real-time streaming data sources.

结论

Conclusion

本章概述了GenAI系统中推理的基本原理和实际应用。通过考察提示技术,我们重点阐述了如何从语言模型中提取结构化推理以增强决策能力。关于重排序架构的讨论展示了推理如何改进相关输出的选择,而对推荐阶段推理的探索则展示了如何整合各种数据源以有效地实现内容个性化。这些组成部分共同构成了一个统一的框架,用于开发超越表面检索的智能系统,从而实现上下文感知和用户导向的响应。这种理解为设计更强大、更易于解释的AI应用奠定了基础。

This chapter has outlined the foundational principles and practical implementations of reasoning within GenAI systems. By examining prompting techniques, we highlighted how structured reasoning can be elicited from language models to enhance decision-making. The discussion on reranking architectures demonstrated how reasoning can improve the selection of relevant outputs, while the exploration of recommendation stage reasoning illustrated the integration of diverse data sources to personalize content effectively. Together, these components form a cohesive framework for developing intelligent systems that go beyond surface-level retrieval, enabling context-aware, user-aligned responses. This understanding sets the stage for designing more robust and interpretable AI applications.

下一章我们将了解其他主题,例如文本转 SQL。

In the next chapter, we will understand other topics like text-to-SQL.

C第十四构建文本到 SQL 系统

CHAPTER 14Building Text-to-SQL Systems

介绍

Introduction

在数据驱动决策时代,使用自然语言与数据库交互的能力已成为一项变革性技术。文本到结构化查询语言SQL )是自然语言处理NLP )的一个分支,它使用户能够将简单的英语查询转换为结构化的SQL命令,从而使即使是非技术用户也能轻松获取复杂的数据洞察。本章将探讨文本到SQL系统的基本原理、系统设计、实际应用和具体实现,特别是那些基于大型语言模型LLM )的系统。

In the age of data-driven decision-making, the ability to interact with databases using natural language has emerged as a transformative capability. Text-to-Structured Query Language (SQL), a branch of natural language processing (NLP), enables users to translate plain English queries into structured SQL commands, allowing even non-technical users to access complex data insights with ease. This chapter explores the fundamental principles, system design, real-world applications, and practical implementation of text-to-SQL systems, particularly those powered by large language models (LLMs).

本章首先介绍文本到 SQL 的基本概念,包括自然语言理解、模式链接和 SQL 查询生成。然后,我们将探讨现代文本到 SQL 系统的架构基础,重点介绍模式感知提示、语言学习模型 (LLM) 和基于工具的编排的作用。本章还将讨论各种应用,从商业智能( BI ) 仪表板到语音分析助手。

We begin by introducing the basic concepts underpinning text-to-SQL, including natural language understanding, schema linking, and SQL query generation. We will then examine the architectural foundations of modern text-to-SQL systems, highlighting the role of schema-aware prompting, LLMs, and tool-based orchestration. The chapter also discusses various applications, from business intelligence (BI) dashboards to voice-enabled analytics assistants.

尽管文本转SQL前景广阔,但它也面临着一些独特的挑战,例如处理歧义查询、确保SQL语句的有效性以及将自然语言与复杂的数据库模式相匹配。本文最后提供了一份实用的实施指南,概述了快速设计、模式集成、验证技术和评估指标等方面的策略。

Despite its promise, text-to-SQL poses unique challenges, including handling ambiguous queries, ensuring SQL validity, and aligning natural language with complex database schemas. We conclude with a practical implementation guide, outlining strategies for prompt design, schema integration, validation techniques, and evaluation metrics.

读完本章,读者将全面了解自然语言界面如何彻底改变数据库的可访问性,并增强整个组织的数据素养。

By the end of this chapter, readers will gain a comprehensive understanding of how natural language interfaces can revolutionize database accessibility and empower broader data literacy across organizations.

结构

Structure

本章涵盖以下主题:

This chapter covers the following topics:

  • 文本转SQL是一个难题
  • Text-to-SQL a hard problem
  • 理解基本概念
  • Understanding basic concepts
  • 探索实际应用
  • Exploration of real-world applications
  • 主要挑战
  • Key challenges
  • 关于设计文本到 SQL 系统的实用指南
  • Practical guidance on designing a text-to-SQL system
  • 使用LLM和文本到SQL系统的实体提取
  • Entity extraction using LLM and text-to-SQL system
  • 提高数据可访问性和可读性
  • Enhance data accessibility and literacy
  • 绩效指标和最佳实践
  • Performance metrics and best practices

目标

Objectives

本章的主要目标是帮助读者建立对文本到 SQL 系统的基础知识,使他们能够掌握如何使用基于现代语言学习模型 (LLM) 的技术将自然语言输入转换为可执行的 SQL 查询。本章将探讨核心概念、系统架构、实际应用、实现策略和评估指标,旨在提供清晰的概念阐释和实用的指导。读者将获得在各自领域内设计、评估或扩展文本到 SQL 管道所需的知识,为下一章讨论的更高级的基于代理的系统做好准备。这将为构建智能的交互式数据访问工作流奠定基础。

The primary objective of this chapter is to equip readers with a foundational understanding of text-to-SQL systems, enabling them to grasp how natural language inputs can be transformed into executable SQL queries using modern LLM-based techniques. By exploring core concepts, system architecture, practical applications, implementation strategies, and evaluation metrics, this chapter aims to provide both conceptual clarity and practical guidance. Readers will gain the necessary knowledge to design, evaluate, or extend text-to-SQL pipelines within their own domains, preparing them for more advanced, agent-based systems discussed in the next chapter. This sets the stage for intelligent, interactive data access workflows.

文本转SQL是一个难题

Text-to-SQL a hard problem

尽管生成式人工智能GenAI )取得了快速发展,尤其是在自然语言处理(NLP)和代码生成方面,但将自然语言翻译成SQL(通常称为文本到SQL )的任务仍然是该领域最具挑战性和最复杂的问题之一。虽然诸如生成式预训练Transformer GPT )之类的语言学习模型(LLM)显著提高了机器的流畅性和上下文理解能力,但它们仍然难以应对SQL生成的精确性、结构化和领域特定性。此外,一系列实际和理论上的挑战也加剧了这一困难,使得文本到SQL系统的广泛部署并非易事,尤其是在企业环境中。

Despite the rapid advancements in generative AI (GenAI), particularly in NLP and code generation, the task of translating natural language into SQL, commonly referred to as text-to-SQL, remains one of the most challenging and nuanced problems in the field. While LLMs such as Generative Pre-trained Transformer (GPT) have significantly improved the fluency and contextual understanding of machines, they still struggle with the precise, structured, and domain-specific nature of SQL generation. This difficulty is compounded by a range of practical and theoretical challenges that make widespread deployment of text-to-SQL systems non-trivial, particularly in enterprise settings.

其中一个根本挑战在于自然语言和结构化数据模式之间的不匹配。人类语言本质上是含糊不清、富含语境的,而且往往……文本转SQL并不完整,而SQL则需要与特定数据库模式相匹配的精确、确定性的规范。用户可能会使用同义词、缩写或特定于业务的术语来引用列或表,这些引用方式可能与模式不完全一致,这就要求模型不仅要理解用户的意图,还要将其准确地映射到数据库结构。这个问题被称为模式链接,它仍然是构建健壮的文本转SQL系统的核心瓶颈之一。

One of the fundamental challenges lies in the misalignment between natural language and structured data schemas. Human language is inherently ambiguous, context-rich, and often incomplete, whereas SQL requires exact, deterministic specifications that match the schema of a particular database. Users may refer to columns or tables in ways that do not directly align with the schema, using synonyms, abbreviations, or business-specific terminology, which requires the model not only to understand the intent but also to map it accurately to the database structure. This issue, known as schema linking, remains one of the core bottlenecks in building robust text-to-SQL systems.

此外,并非所有组织的数据都适合基于 GenAI 的查询。大多数企业数据库的设计都以性能和与旧系统兼容性为导向,而非语义可访问性。它们可能缺乏完善的文档,使用不一致的命名约定,或者包含即使对于经验丰富的工程师也难以理解的深度嵌套模式。如果没有清晰、结构良好且注释丰富的元数据,即使是最强大的语言学习模型 (LLM) 也难以生成有效且上下文准确的 SQL 查询。企业数据环境中 GenAI 准备不足的现状严重限制了文本到 SQL 系统在许多组织中的实际应用。

Furthermore, not every organization’s data is ready for GenAI-based querying. Most enterprise databases are designed for performance and legacy compatibility, not for semantic accessibility. They may lack proper documentation, use inconsistent naming conventions, or contain deeply nested schemas that are hard to interpret even for experienced engineers. Without clean, well-structured, and richly annotated metadata, even the most powerful LLMs struggle to produce valid and contextually accurate SQL queries. This lack of GenAI readiness in corporate data environments severely limits the practical applicability of text-to-SQL systems in many organizations.

另一个挑战是跨领域通用性不足。虽然在SpiderWikiSQL等基准数据集上进行微调的学习学习模型 (LLM)在学术环境中表现良好,但当应用于模式设计、数据质量或业务逻辑存在差异的真实数据库时,其有效性会显著下降。特定领域的细微差别通常需要定制提示、使用专有数据进行微调以及融入领域知识,这会增加开发复杂性并降低可扩展性。

Another challenge is the lack of generalizability across domains. While LLMs fine-tuned on benchmark datasets like Spider or WikiSQL perform reasonably well in academic settings, their effectiveness drops significantly when applied to real-world databases that differ in schema design, data quality, or business logic. Domain-specific nuances often require customization of prompts, fine-tuning on proprietary data, and the inclusion of domain knowledge, which increases development complexity and reduces scalability.

此外,确保生成的 SQL 语句的正确性和安全性也存在重大风险。不正确或格式错误的 SQL 查询可能导致性能下降、隐私泄露,甚至在涉及写入操作时造成数据损坏。验证 LLM 的输出需要执行时检查、权限约束,理想情况下还需要人机交互( HITL ) 系统,所有这些都会引入延迟和运维开销。

Additionally, ensuring the correctness and safety of the generated SQL poses a significant risk. Incorrect or malformed SQL queries can lead to performance degradation, privacy violations, or even data corruption if write operations are involved. Validating the output of LLMs requires execution-time checks, permission constraints, and ideally a human-in-the-loop (HITL) system, all of which introduce latency and operational overhead.

总而言之,尽管 GenAI 为自然语言理解和生成带来了前所未有的能力,但 SQL 生成的结构化、上下文相关和高风险特性使得文本到 SQL 的转换仍然是一个长期存在的难题。在文本到 SQL 能够在企业环境中得到广泛、可靠的应用之前,必须认真解决模式对齐、数据准备、领域泛化和执行安全性等挑战。

In summary, while GenAI has brought unprecedented capabilities to natural language understanding and generation, the structured, context-specific, and high-stakes nature of SQL generation makes text-to-SQL an enduringly difficult problem. The challenges of schema alignment, data readiness, domain generalization, and execution safety must all be carefully addressed before text-to-SQL can achieve widespread, reliable adoption in enterprise environments.

理解基本概念

Understanding basic concepts

文本到 SQL 转换是指将自然语言查询转换为可在关系数据库上执行的 SQL 语句。其目标是使非技术用户无需精通 SQL 语法或深入了解底层数据模式即可与数据库交互。此转换涉及多个关键组件,包括自然语言理解、模式链接、语义解析和 SQL 查询生成。

Text-to-SQL refers to the task of translating natural language queries into SQL statements that can be executed on relational databases. The goal is to enable non-technical users to interact with databases without needing expertise in SQL syntax or a deep understanding of the underlying data schema. This transformation involves several key components, including natural language understanding, schema linking, semantic parsing, and SQL query generation.

自然语言理解是系统解读用户通过人类语言表达的意图的初始阶段。例如,如果用户问“总销售额是多少?”2023 年每个地区的销售额是多少?系统必须识别诸如销售额地区时间约束 2023 年之类的实体。这需要句法分析(例如,词性标注、依存句法分析)和语义解释(例如,识别出“总计”意味着聚合函数)。

Natural language understanding is the initial phase where the system interprets the user's intent conveyed through human language. For instance, if a user asks, what are the total sales for each region in 2023? The system must identify entities such as sales, region, and the temporal constraint 2023. This requires both syntactic analysis (e.g., parts-of-speech tagging, dependency parsing) and semantic interpretation (e.g., recognizing that total implies an aggregation function).

模式链接是数据库操作中一个至关重要的环节,它涉及将自然语言元素与数据库模式组件进行匹配。在实践中,这需要将“总销售额”等短语映射到销售表中的特定列,并将“区域”映射到同一表中的列或相关区域表中的列。有效的模式链接通常涉及同义词解析、实体识别和歧义消除,这些在异构或文档不完善的数据库中并非易事。模式链接可以分为显式(查询词与模式词直接匹配)、隐式(需要基于上下文进行推理)和模糊(处理模糊或歧义的指代)。

Schema linking is a fundamental aspect that involves aligning the natural language elements with the database schema components. In practice, this requires mapping phrases like total sales to a specific column in a sales table, and region to either a column in the same table or in a related regions table. Effective schema linking often involves synonym resolution, entity recognition, and disambiguation, which are non-trivial in heterogeneous or poorly documented databases. Schema linking can be categorized into explicit (direct matches between query and schema terms), implicit (requiring inference based on context), and fuzzy (handling vague or ambiguous references).

语义解析是将解释后的自然语言转换为结构化逻辑形式(例如抽象语法树或逻辑查询计划)的过程。这种表示形式以可转换为 SQL 的格式捕获了用户请求的语义。不同的解析技术包括基于规则的系统、统计模型以及神经网络方法(例如带有注意力机制的编码器-解码器架构)。

Semantic parsing is the process of converting the interpreted natural language into a structured logical form, such as an abstract syntax tree or logical query plan. This representation captures the semantics of the user’s request in a format that can be translated into SQL. Different parsing techniques include rule-based systems, statistical models, and neural approaches such as encoder-decoder architectures with attention mechanisms.

SQL 生成涉及将逻辑形式映射到可执行的 SQL 查询。这包括确定合适的 SQL 子句(SELECT FROM WHERE GROUP BY)、解析相关表之间的连接、应用聚合函数以及确保正确的筛选条件。例如,自然语言问题“哪些产品在 2023 年 1 月的销量超过 1000 件?”将被翻译成:

SQL generation involves mapping the logical form into an executable SQL query. This includes determining the appropriate SQL clauses (SELECT, FROM, WHERE, GROUP BY, etc.), resolving joins between related tables, applying aggregation functions, and ensuring correct filtering conditions. For example, the natural language question, which products sold more than 1,000 units in January 2023? Would be translated into:

SELECT product_name FROM sales WHERE units_sold > 1000 AND sale_date BETWEEN '2023-01-01' AND '2023-01-31';

SELECT product_name FROM sales WHERE units_sold > 1000 AND sale_date BETWEEN '2023-01-01' AND '2023-01-31';

这种转变表明,需要在人类意图和机器可读语法之间建立精确的映射关系。

This transformation shows the need for precise mapping between human intent and machine-readable syntax.

从历史上看,文本到 SQL 的系统最初是基于规则或模板驱动的方法,依赖于手工编写的语法和有限的词汇表。这些系统缺乏可扩展性和跨领域的适应性。机器学习ML ),尤其是深度学习的引入,标志着向更灵活、数据驱动的方法转变。序列到序列Seq2Seq )模型、注意力机制以及最近出现的诸如 GPT 和 Codex 等语言学习模型(LLM)的应用,极大地推动了该领域的发展。

Historically, text-to-SQL systems started as rule-based or template-driven methods that relied on handcrafted grammars and limited vocabularies. These systems lacked scalability and adaptability across domains. The introduction of machine learning (ML), especially deep learning, marked a shift toward more flexible, data-driven approaches. The use of sequence-to-sequence (Seq2Seq) models, attention mechanisms, and, more recently, LLMs such as GPT and Codex, has significantly advanced the state of the art.

还必须考虑不同类型的 SQL 查询。简单的查询涉及SELECTWHERE子句,而更复杂的查询则涉及连接、聚合、嵌套子查询、窗口函数以及集合运算(例如UNIONINTERSECT)。理解这些类型对于满足各种用户意图至关重要。

Different types of SQL queries must also be considered. Simple queries involve SELECT or WHERE clauses, but more complex queries involve joins, aggregations, nested subqueries, window functions, and set operations like UNION or INTERSECT. Understanding these types is essential to cover a wide range of user intents.

文本到 SQL 系统可以根据其训练过程中监督的程度进行分类:完全监督系统需要配对的自然语言和 SQL 示例;弱监督系统依赖于间接监督(例如,执行结果);以及无监督系统。尝试在没有明确训练示例的情况下学习映射关系。另一种有用的分类方法是基于交互风格,即单次查询与支持后续提问和澄清的多轮对话系统。

Text-to-SQL systems can be categorized based on the level of supervision in their training: fully supervised systems require paired natural language and SQL examples; weakly supervised systems rely on indirect supervision (e.g., execution results); and unsupervised systems attempt to learn mappings without explicit training examples. Another useful classification is based on the interaction style, single-shot queries vs. multi-turn dialogue systems that support follow-up questions and clarification.

在模式表示方面,系统必须处理各种复杂情况,包括扁平模式(单表)、层次模式(父子关系)和关系图(带外键的多表数据库)。以语言学习模型(LLM)能够理解的方式表示模式,例如序列化表模式、表实体图或嵌入,对于准确生成查询至关重要。

In terms of schema representation, systems must handle various complexities, including flat schemas (single table), hierarchical schemas (parent-child relationships), and relational graphs (multi-table databases with foreign keys). Representing the schema in a way that LLMs can understand, such as serialized table schemas, table-entity graphs, or embeddings, is crucial for accurate query generation.

在现代生成人工智能(GenAI)领域,大型预训练模型已被证明能够有效地理解和生成 SQL 查询。然而,它们仍然严重依赖于提示质量和模式感知能力。提示工程、检索增强生成RAG )以及基于工具的增强(例如,调用应用程序编程接口API )函数)等技术通常用于提高准确性和泛化能力。

In modern GenAI contexts, large pre-trained models have proven effective in understanding and generating SQL queries. However, they still depend heavily on prompt quality and schema-awareness. Techniques such as prompt engineering, retrieval-augmented generation (RAG), and tool-based augmentation [e.g., function calling application programming interface (APIs)] are commonly used to improve accuracy and generalizability.

文本转 SQL 的应用场景涵盖各个领域。在金融领域,用户可以查询交易量或平均收入。在医疗保健领域,医生可能需要按病情或时间段筛选患者数据。在教育领域,学生可以通过比较自然语言和正式查询语句来学习 SQL。在公共数据访问领域,公民可以使用自然语言提问,从开放的政府数据库中获取信息。

Use cases for text-to-SQL span across domains. In finance, users may query transaction volumes or average revenue. In healthcare, physicians might ask for patient data filtered by conditions or timeframes. In education, students can learn SQL by comparing natural language and formal query pairs. In public data access, citizens can ask natural language questions to extract insights from open government databases.

尽管技术不断进步,文本到 SQL 系统中常见的错误仍然包括语义漂移(生成的 SQL 与原始意图不符)、错误的表或列引用,以及对过滤器或约束的误解。要缓解这些问题,需要强大的模式链接、对语言的深刻理解以及动态验证机制。

Despite advancements, common errors in text-to-SQL systems include semantic drift (where the generated SQL does not match the original intent), incorrect table or column references, and misinterpretation of filters or constraints. Mitigating these issues requires robust schema linking, strong language understanding, and dynamic validation mechanisms.

理解文本到 SQL 的基本概念需要剖析将人类语言解析、链接和翻译成 SQL 的多步骤过程。随着 GenAI 模型的发展,这些系统有望变得更加易用、适应性更强、准确性更高,但其基本原理对于成功实施仍然至关重要。

Understanding the basic concepts of text-to-SQL involves dissecting the multi-step process of parsing, linking, and translating human language into SQL. As GenAI models evolve, these systems are poised to become more accessible, adaptable, and accurate, but the foundational principles remain critical for successful implementation.

探索实际应用

Exploration of real-world applications

文本转 SQL 系统的实用价值遍及各行各业,它通过自然语言界面,使用户能够更直观、高效地访问数据。随着企业越来越多地采用数据驱动的决策流程,非技术利益相关者直接与结构化数据库交互的需求变得至关重要。文本转 SQL 提供了一种弥合这一差距的机制,有助于促进包容性,并使数据洞察的获取更加民主化。以下是文本转 SQL 正在产生重大影响的关键领域:

The practical value of text-to-SQL systems extends across a wide spectrum of industries, enabling more intuitive and efficient access to data through natural language interfaces. As organizations increasingly adopt data-driven decision-making processes, the need for non-technical stakeholders to interact directly with structured databases becomes critical. Text-to-SQL provides a mechanism for bridging this gap, fostering inclusivity, and democratizing access to insights. The following are the key domains where text-to-SQL is making a significant impact:

  • 商业智能和分析:文本转 SQL 最常见的应用之一是在商业智能平台和分析仪表板中。Power BI TableauLooker等商业智能工具通常配置静态 SQL 查询或筛选器,​​因此需要技术专长才能进行修改。通过集成文本到 SQL 引擎,业务分析师、产品经理或销售主管可以使用自然语言查询数据存储库。例如,用户可以询问“ 2023 年第二季度收入排名前五的产品是什么?” ,然后即可获得由动态生成的 SQL 查询驱动的可视化表格或图表。此功能可减轻 IT 团队的负担并加速洞察发现。
  • BI and analytics: One of the most common applications of text-to-SQL is within BI platforms and analytical dashboards. BI tools like Power BI, Tableau, and Looker are often configured with static SQL queries or filters, requiring technical expertise to modify. By integrating a text-to-SQL engine, business analysts, product managers, or sales executives can query data repositories using natural language. For instance, a user could ask, what were the top five products by revenue in Q2 2023? And receive a visualized table or chart powered by a dynamically generated SQL query. This capability reduces the burden on IT teams and accelerates insight discovery.
  • 对话式界面和虚拟助手:文本转 SQL 是对话式数据代理的核心组件——这类虚拟助手允许用户用自然语言提出数据相关问题。这些系统正被嵌入到企业聊天平台(例如Slack Teams或自定义内部仪表盘)中,以便实时响应数据查询。例如,市场经理可以询问“上周有多少用户通过推荐计划注册?” ,并立即获得由后端执行的实时 SQL 查询支持的回复。
  • Conversational interfaces and virtual assistants: Text-to-SQL serves as a core component in conversational data agents—virtual assistants that allow users to pose questions about data in plain language. These systems are being embedded into enterprise chat platforms (like Slack, Teams, or custom internal dashboards), where they can respond to real-time data queries. A marketing manager might ask, how many users signed up through the referral program last week? And receive an immediate response, backed by a live SQL query executed on the backend.
  • 客户支持和运营分析:支持团队可受益于自然语言界面,从而监控绩效指标和客户反馈。文本转 SQL 系统使支持经理能够查询“过去一个月每位客服人员的平均工单解决时间” ,或列出所有未解决的高优先级问题。这些系统消除了通常因等待技术人员编写或运行 SQL 脚本而造成的延误。
  • Customer support and operations analytics: Support teams benefit from natural language interfaces that allow them to monitor performance metrics and customer feedback. Text-to-SQL systems can enable support managers to ask, show me the average ticket resolution time by agent in the past month, or list all unresolved high-priority issues. These systems eliminate delays that typically arise from waiting for technical staff to write or run SQL scripts.
  • 医疗保健和临床信息学:在临床环境中,文本转SQL可以帮助医疗服务提供者检索相关的患者数据、病史或汇总结果。例如,临床医生可以查询并列出过去三个月内血糖升高的60岁以上糖尿病患者。在许多电子健康记录EHR )系统中,底层数据复杂,需要掌握模式和医学术语。文本转SQL弥合了这一鸿沟,在提高数据可访问性的同时,结合强大的访问控制机制,还能确保数据合规性。
  • Healthcare and clinical informatics: In clinical settings, text-to-SQL can assist healthcare providers in retrieving relevant patient data, medical history, or aggregated outcomes. For example, a clinician might query, list diabetic patients over 60 who had elevated blood glucose in the last 3 months. In many electronic health record (EHR) systems, the underlying data is complex and requires knowledge of both schema and medical terminology. Text-to-SQL bridges this divide, improving accessibility while maintaining data compliance when paired with robust access control mechanisms.
  • 教育与 SQL 学习环境:融合文本到 SQL 功能的教育工具为学生提供了学习数据库和查询编写的交互式方式。专为数据科学或计算机科学教学设计的平台通常允许学习者用英语输入问题,并观察其如何映射到 SQL 查询。这些通过将直觉思维与形式逻辑相结合的学习支架,还可以作为调试或逆向工程工具。
  • Education and SQL learning environments: Educational tools that incorporate text-to-SQL offer students interactive ways to learn about databases and query formulation. Platforms designed for teaching data science or computer science often allow learners to enter a question in English and observe how it maps to an SQL query. These scaffolds for learning by connecting intuitive thinking to formal logic can also serve as a debugging or reverse-engineering tool.
  • 开放政府与公民科技:政府机构维护的公共数据门户通常通过基于 SQL 的 API 公开数据集。然而,公众用户往往缺乏编写查询语句的专业知识。将文本到 SQL 的接口集成到公民平台中,可以让公民提出诸如“哪些地区在 2022 年获得了最多的教育经费?”之类的问题,并无障碍地访问经过筛选的数据。这有助于提高透明度、促进公民参与和政策分析。
  • Open government and civic technology: Public data portals maintained by government agencies often expose datasets through SQL-backed APIs. However, public users frequently lack the expertise to write queries. Integrating text-to-SQL interfaces into civic platforms can allow citizens to pose questions like, which districts received the most education funding in 2022? And access curated data without barriers. This enhances transparency, civic participation, and policy analysis.
  • 零售和电子商务个性化:在零售行业,品类经理、库存计划员和营销团队通常需要快速访问运营数据。指标。借助文本转SQL,他们可以查询:上个月哪些SKU在三个以上地区出现缺货?或者,排灯节期间线上购物的平均客单价是多少?这些洞察能够指导营销活动设计、产品陈列和供应链响应。
  • Retail and e-commerce personalization: In the retail sector, category managers, inventory planners, and marketing teams often require fast access to operational metrics. With text-to-SQL, they can ask, which SKUs had stockouts in more than three regions last month? Or, what was the average basket size for online purchases in Diwali week? These insights drive campaign design, product placement, and supply chain responsiveness.
  • 金融服务和风险监控:财务分析师和审计师经常依赖历史数据进行预测、合规性和欺诈检测。文本转 SQL 可以帮助查询结构化财务系统,例如,可以使用“列出 2023 年 1 月至 3 月期间所有超过 10,000 美元且被标记为可疑的交易”之类的提示。这降低了审计师或合规官的准入门槛,他们可能并非 SQL 专家,但需要直接访问及时的数据。
  • Financial services and risk monitoring: Financial analysts and auditors frequently rely on historical data for forecasting, compliance, and fraud detection. Text-to-SQL can assist in querying structured financial systems using prompts such as, list all transactions over $10,000 flagged as suspicious between January and March 2023. This lowers entry barriers for auditors or compliance officers who may not be SQL experts but need direct access to timely data.
  • 制造与物联网运营:随着工业物联网( IIoT )的兴起,制造系统会产生海量的结构化遥测数据。工程师和运营经理可以使用文本转 SQL 功能来调查性能异常或效率指标。例如,显示上个季度 A 工厂所有停机时间超过 2 小时的机器故障记录。这有助于促进主动维护并减少停机时间。
  • Manufacturing and IoT operations: With the rise of industrial IoT (IIoT), manufacturing systems generate vast amounts of structured telemetry data. Engineers and operations managers may use text-to-SQL to investigate performance anomalies or efficiency metrics. For example, show me all machine failures logged in Plant A with downtime greater than 2 hours last quarter. This promotes proactive maintenance and reduces downtime.
  • 人力资源与组织规划人力资源( HR ) 专业人员可以利用文本转 SQL 查询员工数据、培训记录和绩效指标。例如,查询第一季度有多少新员工完成了入职培训?或者,销售部门员工的平均任期是多少?这有助于进行劳动力分析和规划。如果合理集成,这些系统还可以支持多元化和包容性报告。
  • Human resources and organizational planning: Human resource (HR) professionals can benefit from text-to-SQL by querying employee data, training history, and performance metrics. Queries like, how many new hires completed onboarding in Q1? Or, what is the average tenure of employees in the sales department? Help in workforce analytics and planning. These systems also support diversity and inclusion reporting when integrated responsibly.

文本转 SQL 的实际应用范围广泛且不断扩展。从赋能内部利益相关者到改善公众数据访问,这些系统正引领着数据生态系统向更具包容性和智能化的方向发展。对于拥有庞大且异构数据集的组织而言,其影响尤为显著,因为手动编写 SQL 脚本的繁琐操作会阻碍决策。通过将自然语言界面嵌入分析工作流程,组织可以实现更广泛的应用、更深入的洞察,并加快从问题到答案的转化路径。

The real-world applications of text-to-SQL are vast and growing. From empowering internal stakeholders to improving public access to data, these systems are at the forefront of a shift toward more inclusive and intelligent data ecosystems. Their impact is particularly significant in organizations with large, heterogeneous datasets, where the friction of manual SQL scripting hinders decision-making. By embedding natural language interfaces into analytics workflows, organizations can unlock broader usage, deeper insights, and a faster path from question to answer.

主要挑战

Key challenges

尽管文本到 SQL 系统为自然语言和结构化数据库之间提供了一个强大的接口,但这项任务仍然面临着诸多技术和实践上的挑战。这些挑战既源于自然语言固有的歧义性,也源于 SQL 的僵化性。理解这些挑战对于设计健壮、可扩展且适用于企业级应用的文本到 SQL 系统至关重要。

While text-to-SQL systems offer a powerful interface between natural language and structured databases, the task remains fraught with substantial technical and practical challenges. These challenges arise from both the inherent ambiguity of natural language and the rigidity of SQL. Understanding these challenges is essential to designing robust, scalable, and enterprise-ready text-to-SQL systems.

以下列举了该领域面临的最主要障碍:

The following list explores the most significant obstacles faced in this domain:

  • 自然语言的歧义性:自然语言本质上是歧义的,并且依赖于上下文。与需要精确语法和语义的 SQL 不同,人类语言常常依赖于隐含意义、语境线索和不完整的表达。这种信息鸿沟使得准确翻译变得困难。

    例如,考虑以下问题:显示上个季度表现最佳的地区。“表现最佳可以指收入、利润率、客户满意度或其他指标。同样,“上个季度”必须相对于当前日期,这就需要时间上下文。如果没有明确的说明,即使是高级的LLM(生命周期管理)也可能难以生成准确的SQL查询。

    多轮对话中的代词和省略号带来了另一层复杂性。例如,在一段对话中,用户先问“列出所有在欧洲销售的产品” ,然后又追问“哪些产品的销量正在下​​降?”。模型必须维护上下文记忆,并将其解析为正确的实体集。这项任务超越了句法翻译的范畴,进入了对话建模和共指消解的领域。

  • Ambiguity in natural language: Natural language is inherently ambiguous and context-dependent. Unlike SQL, which requires precise syntax and semantics, human language often relies on implied meanings, contextual cues, and incomplete expressions. This gap makes accurate translation difficult.

    For instance, consider the question, show me the top-performing regions last quarter. The term top-performing could refer to revenue, profit margin, customer satisfaction, or some other metric. Similarly, last quarter must be resolved relative to the current date, requiring temporal context. In the absence of explicit clarification, even advanced LLMs may struggle to generate accurate SQL queries.

    Another layer of complexity arises from pronouns and ellipsis in multi-turn dialogues. In a conversation where a user first asks, list all products sold in Europe, and then follows up with which of them had declining sales? The model must maintain contextual memory and resolve them to the correct entity set, a task that goes beyond syntactic translation and enters into the realm of dialogue modeling and co-reference resolution.

  • 模式对齐和模式链接:文本到 SQL 系统的一项基本要求是模式对齐,即将用户语言映射到底层数据库的特定模式元素。这包括识别查询中提到的实体和属性对应的表名和列名。当模式庞大、使用不直观的名称或文档稀少时,这个问题尤其具有挑战性。

    模式链接涉及将诸如“收入最高的员工”之类的表达式解析为数据库中类似employee.salary 的值。复杂性会随着以下因素而增加:

    • 同义词(例如,收入与收益
    • 缩写(例如,dept与 department
    • 隐藏关系(例如,表之间的连接路径并非显而易见
    • 用户查询中的多语言表达式

    这需要对语义有深刻的理解,并且通常需要将模式上下文嵌入到提示或模型输入中,以支持准确的基础。

  • Schema alignment and schema linking: A fundamental requirement in text-to-SQL systems is schema alignment, mapping user language to the specific schema elements of the underlying database. This includes identifying which table and column names correspond to entities and attributes mentioned in the query. The problem is particularly challenging when the schema is large, uses non-intuitive names, or is sparsely documented.

    Schema linking involves resolving expressions like the highest earning employee to something like employee.salary in the database. The complexity increases with the following:

    • Synonyms (e.g., income vs. revenue)
    • Abbreviations (e.g., dept vs. department)
    • Hidden relationships (e.g., join paths between tables not immediately obvious)
    • Multilingual expressions in user queries

    This necessitates deep semantic understanding and often requires embedding the schema context into the prompt or model input in a way that supports accurate grounding.

  • 缺乏领域泛化能力:在基准数据集上训练的文本到 SQL 模型通常在这些数据集范围内表现良好,但应用于特定领域的企业级模式时则表现不佳。这个问题被称为领域泛化能力,当模型面临以下情况时,这个问题会变得更加突出:
    • 未显示的表名和列结构。
    • 行业特定术语(例如,保险中的索赔率)。
    • 高度规范化的关系型数据库。

      即使是像 GPT-4 这样的语言学习模型,如果没有模式条件或针对领域相关查询的微调,也会表现不佳。这限制了文本到 SQL 的开箱即用性。解决方案需要领域适应技术,例如 RAG、模式预嵌入和提示工程,并提供特定领域的示例。

  • Lack of domain generalization: Text-to-SQL models trained on benchmark datasets often perform well within the scope of those datasets but struggle when applied to domain-specific enterprise schemas. This issue, referred to as domain generalization, becomes more pronounced when the model is exposed to the following:
    • Unseen table names and column structures.
    • Industry-specific terminology (e.g., claims ratio in insurance).
    • Highly normalized relational databases.

      Even LLMs such as GPT-4 can falter without schema conditioning or fine-tuning on domain-relevant queries. This limits the out-of-the-box utility of text-to-SQL solutions and necessitates domain adaptation techniques such as RAG, schema pre-embedding, and prompt engineering with domain-specific examples.

  • SQL 语法和逻辑有效性:生成语法有效的 SQL 并非易事,尤其是在处理涉及多个连接、嵌套子查询、聚合和窗口函数的复杂查询结构时。LLM 虽然能够生成看似合理的 SQL,但通常会生成以下类型的查询:
    • 语法错误。
    • 引用不存在的列或表。
    • 在WHEREJOIN子句中包含相互矛盾的条件

    除了语法之外,逻辑有效性也是一大挑战。查询可能运行无误,但却返回错误或误导性的结果。例如,GROUP BY子句位置错误或缺少HAVING筛选器都可能改变查询的语义,从而导致一些不易察觉的分析错误。

  • SQL syntax and logical validity: Generating syntactically valid SQL is a non-trivial task, particularly when dealing with complex query structures involving multiple joins, nested subqueries, aggregations, and window functions. LLMs, while capable of generating plausible-looking SQL, often produce queries that:
    • Are syntactically incorrect.
    • Reference non-existent columns or tables.
    • Include contradictory conditions in WHERE or JOIN clauses.

    Beyond syntax, logical validity is another challenge. A query might run without error but return incorrect or misleading results. For example, an incorrectly placed GROUP BY clause or a missing HAVING filter can change the semantics of the query, resulting in analytics errors that may go unnoticed.

  • 查询执行限制:即使生成的 SQL 语句在语法和语义上都有效,在生产环境中执行这些语句也会带来风险和限制。主要挑战包括以下几点:
    • 性能问题:优化不佳的查询可能会耗尽数据库资源。
    • 安全性:存在 SQL 注入或未经授权访问敏感数据的风险。
    • 数据新鲜度:不了解最新模式或数据变更的模型可能会产生过时或不相关的查询。

    此外,查询执行需要实时访问数据库,这使得这些系统的培训、调试和部署变得更加复杂。虽然通常需要离线验证环境或测试沙箱,但它们并不总是能准确地复制生产环境的模式或数据量。

  • Query execution constraints: Even when syntactically and semantically valid, executing the generated SQL poses risks and constraints in production environments. Key challenges include the following:
    • Performance issues: poorly optimized queries may strain database resources.
    • Security: risk of SQL injection or unauthorized access to sensitive data.
    • Data freshness: models unaware of recent schema or data changes may produce outdated or irrelevant queries.

    Additionally, query execution requires live database access, which complicates the training, debugging, and deployment of these systems. Offline validation environments or test sandboxes are often required, but they do not always replicate the production schema or data volume accurately.

  • 多轮交互和对话上下文:在实际应用中,用户经常与数据代理进行多轮对话。这就带来了如何在交互过程中保持上下文连贯性、解读后续问题以及迭代地改进结果的挑战。

    设想这样一段对话:

    • 请提供第一季度的销售数据
    • 现在按地区细分一下
    • 排除退货率超过 10% 的产品。

    每条语句都依赖于前一条语句所建立的上下文。在对话回合中维护不断演变的查询结构、过滤条件和目标表引用,是一项重大的架构和建模挑战。这需要具备内存感知能力的系统,能够维护和更新查询状态,或构建对话的语义图。

  • Multi-turn interaction and dialogue context: In real-world applications, users often engage in multi-turn conversations with data agents. This introduces the challenge of maintaining context across interactions, interpreting follow-up questions, and refining results iteratively.

    Consider a dialogue like:

    • Show me sales for Q1.
    • Now break that down by region.
    • Exclude products with returns over 10%.

    Each utterance depends on the context established by the previous ones. Maintaining the evolving query structure, filtering criteria, and target table references across turns is a significant architectural and modeling challenge. It requires memory-aware systems capable of maintaining and updating query state or constructing semantic graphs of the conversation.

  • 评估和反馈循环:评估文本到 SQL 系统的正确性本身就是一项复杂的任务。执行准确率(即查询是否返回正确结果)通常比精确字符串匹配更受青睐,因为多种 SQL 语句可能产生相同的输出。然而,基于执行的指标需要实时数据库或模拟环境。

    此外,如何利用用户的纠正、错误或认可信号构建反馈回路仍然是一个开放的研究领域。结合基于人类反馈的强化学习RLHF )、置信度评分和回退机制可以帮助提高可靠性,但也会增加设计的复杂性。

  • Evaluation and feedback loops: Evaluating the correctness of text-to-SQL systems is itself a complex task. Execution accuracy (i.e., whether the query returns the correct result) is often preferred over exact string match because multiple SQL formulations can yield the same output. However, execution-based metrics require a live database or simulation environment.

    Moreover, building feedback loops from user corrections, errors, or approval signals remains an open research area. Incorporating reinforcement learning from human feedback (RLHF), confidence scoring, and fallback mechanisms can help improve reliability but introduce further design complexity.

  • 数据隐私和治理:在企业环境中,通过 SQL 访问数据必须遵守严格的治理策略。文本到 SQL 系统必须设计成:
    • 遵守行级和列级访问限制。
    • 屏蔽敏感字段(例如,个人身份信息PII )、财务数据)。
    • 记录并审核生成的查询,以确保合规性。

    否则可能导致数据泄露、审计失败或违反监管规定。这为系统设计增加了一层责任,而不仅仅局限于模型准确性。

  • Data privacy and governance: In enterprise environments, access to data via SQL must adhere to strict governance policies. Text-to-SQL systems must be designed to:
    • Respect row-level and column-level access restrictions.
    • Mask sensitive fields (e.g., personally identifiable information (PII), financial data).
    • Log and audit generated queries for compliance.

    Failure to do so could result in data leaks, audit failures, or regulatory violations. This adds another layer of responsibility to system design, beyond just model accuracy.

  • 用户意图消歧:理解用户的真实需求通常需要超越表面语义的实用推理。例如,“显示去年最佳客户”这样的查询会留下一些未解之谜:
    • 如何最好地定义它?以收入、订单频率还是客户留存率来衡量?
    • 该模型应该默认使用一个指标,还是需要进一步说明?

      意图 消歧策略包括以下几种:

      • 查询澄清对话。
      • 多项选择题消歧义提示。
      • 用户自定义默认值或配置文件。

    这些策略必须在用户体验UX )(保持交互效率)与可解释性和正确性之间取得平衡。

  • User intent disambiguation: Understanding what the user truly wants often requires pragmatic inference beyond surface semantics. A query like show me the best customers last year leaves open questions:
    • How is best defined? By revenue, order frequency, or retention?
    • Should the model default to one metric or ask for clarification?

      o Intent disambiguation strategies include the following:

      • Query clarification dialogues.
      • Multiple choice disambiguation prompts.
      • User-defined defaults or profiles.

    These strategies must balance user experience (UX) (keeping interactions efficient) with interpretability and correctness.

尽管文本到 SQL 代表了一种连接人类语言和结构化数据库的极具前景的接口,但其在实际应用中仍面临诸多技术、语言和组织方面的挑战。从解决自然语言歧义到确保 SQL 的安全性和执行正确性,从用户提问到可执行查询的整个过程都充满了潜在的故障点。应对这些挑战需要在模型设计、模式表示、领域自适应和以用户为中心的交互设计方面取得进展。只有采用融合人工智能、数据工程和用户体验的整体方法,才能开发出强大且值得信赖的文本到 SQL 系统。

While text-to-SQL represents a promising interface between human language and structured databases, its implementation in real-world settings is constrained by a host of technical, linguistic, and organizational challenges. From resolving natural language ambiguity to ensuring SQL safety and execution correctness, the path from user question to executable query is fraught with potential failure points. Addressing these challenges requires advances in model design, schema representation, domain adaptation, and user-centered interaction design. Only through a holistic approach that blends AI, data engineering, and UX considerations can robust and trustworthy text-to-SQL systems be developed.

关于设计文本到 SQL 系统的实用指南

Practical guidance on designing a text-to-SQL system

使用现代语言模型实现一个稳健的文本到 SQL 系统需要精心协调多个组件,涵盖提示设计和模式集成、输出验证以及系统监控等各个方面。虽然像 GPT-4 这样的语言模型极大地提高了数据库自然语言接口的可行性,但其原始输出必须经过严格的控制、处理和评估,以确保在实际环境中的正确性和安全性。本节提供了一个全面的分步指南,指导您如何实现此类系统,重点介绍基于当前行业实践的实用策略。

Implementing a robust text-to-SQL system using modern language models requires careful orchestration of multiple components, ranging from prompt design and schema integration to output validation and system monitoring. While LLMs like GPT-4 have dramatically improved the feasibility of natural language interfaces for databases, their raw outputs must be carefully controlled, conditioned, and evaluated to ensure both correctness and safety in real-world settings. This section provides a comprehensive, step-by-step guide for implementing such systems, with an emphasis on pragmatic strategies grounded in current industry practices.

下图展示了现代文本到 SQL 系统的高级架构,重点突出了从语言模型提示到 SQL 验证和可观测性的关键阶段。它涵盖了构建健壮可靠的自然语言到 SQL 接口所必需的关键组件,例如模式集成、用户澄清、回退机制、多模态扩展和反馈循环。

The following figure illustrates a high-level architecture of a modern text-to-SQL system, highlighting the critical stages from language model prompting to SQL validation and observability. It captures key components such as schema integration, user clarification, fallback mechanisms, multimodal extensions, and feedback loops essential for building robust and reliable natural language to SQL interfaces.

流程图展示了人工智能流程中的步骤:策略、数据汇总、模式集成、规划、合规性、查询澄清、重试、模型部署、SQL 验证以及带反馈的可观测性。

图 14.1:端到端文本到 SQL 管道

Figure 14.1: End-to-end text-to-SQL pipeline

以下是对图 14.1的解释:

The following is an explanation of Figure 14.1:

1. 提示策略和语言模型条件化:提示设计是文本到 SQL 实现的关键部分。由于语言模型 (LLM) 采用零样本或少样本范式,精心构建的提示可以显著影响其将自然语言翻译成正确 SQL 的能力。

1. Prompting strategies and language model conditioning: Prompt engineering is a critical part of text-to-SQL implementation. Since LLMs operate within a zero-shot or few-shot paradigm, carefully constructed prompts can significantly influence their ability to translate natural language to correct SQL.

零样本提示:这种方法假设模型已经过 SQL 模式的预训练。基本的提示可能只是简单地呈现用户查询和数据库模式,然后给出指令:生成相应的 SQL 查询

Zero-shot prompting: This approach assumes the model has been pre-trained on SQL patterns. A basic prompt might simply present the user query and database schema, followed by the instruction: Generate the corresponding SQL query.

例子

Example:

o 输入列出上个月下单超过 5 次的客户

o Input: List customers who placed more than 5 orders last month.

o 模式客户id name ),订单id customer_id order_date

o Schema: Customers (id, name), Orders (id, customer_id, order_date)

该模型必须仅根据上下文推断出正确的连接和时间过滤器。

The model must infer the correct join and time filter from context alone.

少样本提示:少样本提示在提示中包含 1-5 个精心挑选的示例,用于说明问题与 SQL 之间的对应关系。这种方法可以提高准确性,尤其是在处理复杂查询时,并且允许注入特定领域的惯用法或业务规则。

Few-shot prompting: Few-shot prompting includes 1-5 manually curated examples in the prompt to illustrate mappings between questions and SQL. This method improves accuracy, especially for complex queries, and allows injection of domain-specific idioms or business rules.

思路链(CoT)提示:对于非常复杂的查询,可以在提示中使用中间推理步骤。例如:首先确定相关表,然后定义筛选条件,最后构建连接。

Chain of thought (CoT) prompting: For very complex queries, one may use intermediate reasoning steps in the prompt. For instance: first identify relevant tables, then define filters, then compose joins.

COT还支持模块化或代理分解方法,这在具有复杂模式的企业环境中特别有用。

COT also enables a modular or agentic decomposition approach, particularly useful in enterprise settings with complex schemas.

2. 模式和元数据集成:LLM 本身并不了解特定数据库的模式及其元数据,除非显式提供。为了弥补这一不足,必须将模式和元数据嵌入到提示中,或作为上下文传递。

2. Schema and meta data integration: LLMs do not inherently know the schema of a specific database and meta data of the schema unless it is explicitly provided. To bridge this gap, the schema and meta data must be embedded into the prompt or passed as context.

a. 扁平化模式列表:表和列仅列在提示符之前。这种方式适用于小型或中型数据库。

a. Flat schema listing: Tables and columns are simply listed before the prompt. This is effective for small or moderately sized databases.

b. 结构化模式编码:对于较大的模式,尤其是具有多个外键和嵌套连接的情况,结构化表示格式(如 JSON、带注释的模式图或实体关系摘要)可能更有效。

b. Structured schema encoding: For larger schemas, especially with multiple foreign keys and nested joins, structured representation formats like JSON, annotated schema graphs, or entity-relationship summaries can be more effective.

c. 语义模式映射:高级实现使用基于嵌入的语义匹配将用户查询词与模式标签关联起来,识别同义词、缩写和隐式引用。例如,将staff映射employee将 revenue映射sales_amount

c. Semantic schema mapping: Advanced implementations use embedding-based semantic matching to relate user query terms with schema labels, identifying synonyms, acronyms, and implicit references. For instance, mapping staff to employe or revenue to sales_amount.

3. 中间规划和分解:某些实现方式可以从中间规划阶段中受益。系统可能不会直接生成 SQL,而是:

3. Intermediate planning and decomposition: Some implementations benefit from intermediate planning stages. Instead of generating SQL directly, the system might:

a. 首先生成自然语言计划(例如,我们需要连接订单和客户,按订单日期筛选,按客户 ID 分组)。

a. First generate a natural language plan (e.g., we need to join Orders and Customers, filter by order_date, group by customer_id).

b. 然后把这个计划转换成 SQL。

b. Then transform that plan into SQL.

c. 这种分解方法允许在每个阶段进行验证,使调试更容易。

c. This decomposition allows for validation at each stage and makes debugging easier.

4. 行和表摘要行摘要是指生成表格中特定行或记录的文本描述。这通常涉及识别关键值、关系或异常情况,并用流畅自然的语言表达出来。表摘要则侧重于生成关于整个数据集的简洁叙述或见解,例如跨多行和多列的趋势、聚合、分布或异常值。

4. Row and table summarization: Row summarization refers to generating a textual description of a specific row or record in a table. This often involves identifying key values, relationships, or anomalies and expressing them in fluent natural language. Table summarization focuses on producing concise narratives or insights about an entire dataset, such as trends, aggregates, distributions, or outliers across multiple rows and columns.

一个。 行汇总工作流程

a. Row summarization workflow:

系统首先解读用户的自然语言提示,并识别目标行(通过 SQL 过滤或查找)。然后,摘要模块(基于规则或由语言学习模型驱动)使用选定的字段和值生成叙述性文本。

The system first interprets the user’s natural language prompt and identifies the target row (via SQL filtering or lookup). Then, a summarization module, either rule-based or powered by LLMs, generates a narrative using selected fields and values.

i. 输入提示总结四月份最畅销的产品。

i. Input prompt: Summarize the top-selling product in April.

ii. 行输出{产品:'智能手表 X',销售额:15,300,地区:'北美'}。

ii. Row output: {Product: 'Smartwatch X', Sales: 15,300, Region: 'North America'}.

iii. 总结Smartwatch X 是 4 月份最畅销的产品,在北美售出 15,300 台

iii. Summary: Smartwatch X was the best-selling product in April, with 15,300 units sold in North America.

b. 表格汇总工作流程

b. Table summarization workflow:

执行返回多行的 SQL 查询后,系统会识别关键指标(平均值、趋势、众数和异常值),并生成摘要。

After executing a SQL query that returns multiple rows, the system identifies key metrics (averages, trends, modes, and anomalies) and generates a summary.

i. 输入提示请提供季度销售汇总信息。

i. Input prompt: Give me a summary of quarterly sales.

二、 概要销售额在各季度稳步增长,第四季度达到峰值,营收达320万美元。北部地区的业绩始终优于其他地区。

ii. Summary: Sales increased steadily over the quarters, peaking in Q4 with $3.2M in revenue. The North region consistently outperformed other regions.

5. SQL 输出验证和安全性:LLM 生成的 SQL 可能存在语法或逻辑错误。因此,在执行查询之前对其进行验证至关重要。

5. SQL output validation and safety: SQL generated by LLMs can be syntactically or logically incorrect. It is important to validate generated queries before execution.

静态分析:应用 SQL 解析器检查语法正确性。诸如 SQLparse 之类的工具或特定于方言的验证器可以捕获基本错误。

Static analysis: Apply a SQL parser to check for syntax correctness. Tools like SQLparse or dialect-specific validators can catch basic errors.

模式感知验证:交叉检查引用的表和列是否存在于目标数据库模式中。

Schema-aware validation: Cross-check whether referenced tables and columns exist in the target database schema.

逻辑验证:一些系统会在示例数据库上执行测试查询,或者将执行限制为只读视图,以防止出现意外的副作用。

Logical validation: Some systems implement test queries on a sample database or restrict execution to read-only views to prevent unintended side effects.

6. 多模态和工具增强型扩展:近期的系统探索了混合架构,其中语言模型(LLM)通过函数调用 API 或插件与外部工具或数据库进行交互。例如,LLM 可以调用 ` get_table_info`工具来动态检索模式元数据,或者使用向量搜索模块来解决歧义的列引用。这些工具增强型 LLM 模糊了静态语言模型和交互式代理之间的界限。

6. Multimodal and tool-augmented extensions: Recent systems explore hybrid architectures where the LLM interfaces with external tools or databases through function-calling APIs or plugins. For instance, an LLM might call a get_table_info tool to dynamically retrieve schema metadata or use a vector search module to resolve ambiguous column references. These tool-augmented LLMs blur the line between static language models and interactive agents.

此外,多模态扩展程序可以包含表格、图表或可视化仪表板等输出格式。虽然这种将文本输入与可视化输出相结合的架构(文本到 SQL 再到可视化)仍处于发展初期,但在商业智能 (BI) 领域正日益普及。

Moreover, multimodal extensions may incorporate tables, charts, or visual dashboards as output formats. While still emerging, architectures that combine text input with visual output (text-to-SQL-to-visualization) are gaining traction in BI settings.

7. 系统集成考量:架构决策还必须考虑延迟、可扩展性和部署环境。一些系统基于云端,通过实时 API 调用OpenAI 的 CodexAnthropic 的 Claude等模型。另一些系统则在本地运行,使用LLM Meta AI ( Llama ) 或Falcon等开源模型,从而提供更好的控制和隐私保护。缓存常用查询结果并将系统组件模块化,可以确保性能和可维护性。

7. System integration considerations: Architectural decisions must also consider latency, scalability, and deployment environment. Some systems are cloud-based with real-time API calls to models like OpenAI’s Codex or Anthropic’s Claude. Others run locally with open-source models like LLM Meta AI (Llama) or Falcon, offering better control and privacy. Caching frequently used query results and modularizing the system components ensures both performance and maintainability.

8. 用户交互和查询澄清:由于许多查询存在歧义,系统应支持澄清提示。如果存在多种 SQL 解释,则应向用户提供选项:

8. User interaction and query clarification: Since many queries are ambiguous, the system should support clarification prompts. If multiple SQL interpretations are possible, present choices to the user:

a . 您指的是按收入排名的前几名客户,还是按订单数量排名的前几名客户?这样可以避免错误的假设,并建立信任。

a. Did you mean top customers by revenue or by number of orders? This prevents wrong assumptions and builds trust.

9. 治理和合规控制:在企业环境中,确保系统:

9. Governance and compliance controls: In enterprise settings, ensure the system:

a. 编辑或屏蔽生成的 SQL 中的敏感字段。

a. Redacts or masks sensitive fields in generated SQL.

b. 强制执行行级访问限制。

b. Enforces row-level access restrictions.

c. 验证用户凭据和访问范围。

c. Validates user credentials and access scope.

与现有身份和访问管理( IAM ) 系统集成,可确保负责任的部署。

Integrating with existing identity and access management (IAM) systems ensures responsible deployment.

10. 文本到 SQL 系统中的回退和重试机制:随着文本到 SQL 系统不断发展以支持使用自然语言访问结构化数据库,它们必须应对各种歧义、错误和不可预测的用户输入。为了确保可靠性和弹性,现代系统实施了回退和重试策略,这是维护可用性、信任度和准确性的关键组成部分。这些机制主要处理以下问题:

10. Fallback and retry mechanisms in text-to-SQL systems: As text-to-SQL systems evolve to support natural language access to structured databases, they must deal with a range of ambiguities, errors, and unpredictable user inputs. To ensure reliability and resilience, modern systems implement fallback and retry strategies, essential components for maintaining usability, trust, and accuracy. These mechanisms deal with the following:

a. 含糊不清的查询(例如,“上个季度有多少笔销售线索成交?” “成交”定义不明确)

a. Ambiguous queries (e.g., How many leads closed last quarter? When closed is not clearly defined)

b. 模式不匹配(例如,当表列为total_sales时,却使用了revenue

b. Schema mismatches (e.g., using revenue when the table column is total_sales)

c. 模型幻觉(例如,引用不存在的表格或列)

c. Model hallucination (e.g., referencing non-existent tables or columns)

d. 执行错误(例如,SQL 语法错误、超时或权限问题)

d. Execution errors (e.g., SQL syntax errors, timeouts, or permission issues)

在文本到 SQL 系统中,回退和重试策略可以有多种形式,每种形式都旨在提高初始查询失败时的可靠性和用户体验。一种方法是自然语言澄清,系统检测到歧义或缺失的上下文后,会提出后续问题,例如,“您指的是总收入还是净利润?”这有助于形成对话循环,从而消除歧义。另一种方法是使用提示优化进行重试,系统会自动使用更具体的模板、模式提示或经过微调的少量示例来调整提示,从而重新生成有效的 SQL,通常不会向用户显示此过程。如果输入过于复杂,系统可能会回退到简化查询,将请求重新表述为更基本的版本,但仍能提供有用的信息;例如,将关于热门 SKU 收入增长的复杂查询简化为“显示各 SKU 的平均收入” 。如果系统完全无法理解提示,则可能会回退到搜索或文档,将用户重定向到仪表板、已保存的查询或相关的架构参考。另一种策略是使用默认查询模板来处理常见请求,例如“十大客户”“月度趋势” ,尤其是在意图明确但无法精确映射的情况下。

Fallback and retry strategies in text-to-SQL systems can take several forms, each designed to improve reliability and UX when the initial query fails. One approach is natural language clarification, where the system detects ambiguity or missing context and responds with a follow-up question, for example, did you mean total revenue or net profit? This encourages a conversational loop that helps disambiguate intent. Another method is retry with prompt refinement, where the system automatically adjusts the prompt using more specific templates, schema hints, or fine-tuned few-shot examples to regenerate valid SQL, typically without exposing this process to the user. In cases where the input is too complex, the system may perform a fallback to a simplified query, reformulating the request into a more basic version that still yields useful insights; for instance, turning a complex query about revenue growth for top SKUs into a simpler show average revenue by SKU. When the system cannot interpret the prompt at all, it may fallback to search or documentation, redirecting the user to dashboards, saved queries, or relevant schema references. Another strategy involves using default query templates for common requests, such as top 10 customers or monthly trend, especially when the intent is clear but exact mapping fails.

一个健全的重试引擎可能遵循以下逻辑:

A robust retry engine might follow this logic:

尝试:

try:

sql = generate_sql(natural_query)

sql = generate_sql(natural_query)

result = execute_sql(sql)

result = execute_sql(sql)

除 SQLValidationError 外:

except SQLValidationError:

sql = regenerate_with_schema_g​​uidance(natural_query)

sql = regenerate_with_schema_guidance(natural_query)

result = execute_sql(sql)

result = execute_sql(sql)

除非 TableOrColumnNotFound:

except TableOrColumnNotFound:

sql = retry_with_synonym_mapping(natural_query)

sql = retry_with_synonym_mapping(natural_query)

result = execute_sql(sql)

result = execute_sql(sql)

例外情况:

except Exception:

回复“抱歉,我没找到您要找的东西。您能换个方式描述一下吗?”

return "Sorry, I couldn’t find what you’re asking for. Could you rephrase?"

11. 模型部署实施可以遵循不同的部署策略,具体如下

11. Deployment of models: Implementation can follow different deployment strategies, which are as follows:

a. 嵌入式 LLM API :使用 OpenAI 的 GPT 或 Azure OpenAI 等外部 API,并提供模式感知提示。

a. Embedded LLM API: Using external APIs like OpenAI’s GPT or Azure OpenAI with schema-aware prompts.

b. 自托管模型:在本地服务器上部署经过微调的较小模型(例如 SQLCoder)。

b. Self-hosted model: Fine-tuned smaller models (e.g., SQLCoder) deployed on local servers.

c. 混合代理系统:基于 LangChain 的编排,使用单独的工具进行解析、验证和重新排序。

c. Hybrid agentic systems: LangChain-based orchestration with separate tools for parsing, validation, and reranking.

部署方式的选择取决于延迟要求、数据安全策略和成本考虑因素。

Choice of deployment depends on latency requirements, data security policies, and cost considerations.

使用 LLM 实现文本到 SQL 系统远不止调用 API 并向用户发出提示那么简单。它涉及对模式上下文的周密整合、精心设计的提示、强大的查询验证以及用户交互设计。如果实现得当,此类系统可以彻底改变用户与数据交互的方式——使非技术用户也能访问结构化数据库,并加速跨领域的洞察生成。然而,在任何实际部署中,可靠性和安全性都必须始终是核心优先事项。

Implementing a text-to-SQL system using LLMs requires far more than calling an API with a user prompt. It involves thoughtful integration of schema context, careful prompt construction, robust query validation, and user interaction design. When implemented well, such systems can transform how users engage with data—making structured databases accessible to non-technical users and accelerating insight generation across domains. However, reliability and safety must remain core priorities in any practical deployment.

12. 可观测性:随着文本到 SQL 系统复杂性的增加,可观测性对于确保可靠性、透明度和持续改进至关重要。可观测性是指通过外部可测量的输出(例如查询日志、模型置信度评分、故障模式和延迟指标)来监控内部状态的能力。无论是学术级系统还是生产级系统,都能从在文本到 SQL 流水线的每个阶段(从自然语言解析到 SQL 生成和执行)使用详细的遥测数据中获益。这有助于错误诊断、用户行为分析、及时优化以及在模型更新期间安全回滚,最终支持系统问责制和负责任的 AI 实践。

12. Observability: As text-to-SQL systems grow in complexity, observability becomes critical for ensuring reliability, transparency, and continuous improvement. Observability refers to the ability to monitor internal states through externally measurable outputs such as query logs, model confidence scores, failure patterns, and latency metrics. Academic and production-grade systems alike benefit from instrumenting each stage of the text-to-SQL pipeline—from natural language parsing to SQL generation and execution—with detailed telemetry. This facilitates error diagnosis, user behavior analysis, prompt optimization, and safe rollback during model updates, ultimately supporting system accountability and responsible AI practices.

13. 反馈循环和纠错接口:最后,在企业级部署中,HITL 升级机制确保未解决的查询被路由至数据分析师,他们可以提供响应,并通过反馈和训练数据为系统改进做出贡献。这些分层回退策略使文本到 SQL 系统更具弹性、适应性和用户友好性。用户应该能够标记错误的输出并提出更正建议。收集这些数据可以用于重新训练、微调或更新规则。

13. Feedback loop and correction interface: Finally, in enterprise-grade deployments, HITL escalation ensures that unresolved queries are routed to data analysts, who can provide responses and contribute to improving the system through feedback and training data. These layered fallback strategies make text-to-SQL systems more resilient, adaptive, and user-friendly. Users should be able to flag incorrect outputs and suggest corrections. Capturing this data enables retraining, fine-tuning, or rule updates.

a. 纠错界面:允许用户编辑生成的 SQL 语句或从排名靠前的备选方案中进行选择。使用此输入来调整未来的提示模板或架构映射。

a. Correction interface: Allow users to edit generated SQL or select from ranked alternatives. Use this input to adjust future prompt templates or schema mappings.

b. 日志记录和分析:跟踪模型置信度、失败原因和常见查询模式。随着时间的推移,这有助于系统改进并识别训练差距。

b. Logging and analytics: Track model confidence, failure reasons, and common query patterns. Over time, this supports system refinement and identifies training gaps.

注意:执行环境设置:可靠的执行环境对于安全地执行查询至关重要:

Note: Execution environment setup: A reliable execution environment is essential for safe query evaluation:

  • 在生产环境之前,先在测试沙箱数据库上运行查询。
  • Run queries against a test sandbox database before production
  • 使用只读副本来防止写入或更新操作
  • Use read-only replicas to prevent write or update operations
  • 设置查询超时限制和资源上限
  • Impose query timeout limits and resource caps
  • 启用所有查询日志记录,以便进行审计和错误跟踪
  • Enable logging of all queries for audit and error tracking

在逐步实现文本到 SQL 系统的指南基础上(该系统通常包含模式导入、提示工程、SQL 解码和查询验证等组件),下一个合乎逻辑的重点是实体提取。实体提取充当非结构化用户查询和结构化数据库元素之间的语义桥梁,使系统能够将自然语言输入与模式词汇表关联起来。无论是作为独立模块还是作为基于代理的编排工作流的一部分,强大的实体提取都能增强可解释性、模块化和 SQL 准确性,为更可靠的下游查询生成奠定基础。

Building on the step-by-step guide for implementing a text-to-SQL system, which typically includes components such as schema ingestion, prompt engineering, SQL decoding, and query validation, the next logical focus is on entity extraction. Entity extraction acts as the semantic bridge between unstructured user queries and structured database elements, enabling the system to ground natural language input in the schema vocabulary. Whether as a standalone module or as part of an agent-based orchestration workflow, robust entity extraction enhances interpretability, modularity, and SQL accuracy, laying the foundation for more reliable downstream query generation.

使用LLM和文本到SQL系统的实体提取

Entity extraction using LLM and text-to-SQL system

该流程实现了一个端到端的自然语言处理工作流,将非结构化的产品评论文本转换为结构化的表格数据。它通过使用本地语言学习模型(LLM)提取命名实体(例如客户姓名和购买日期),将其与现有的表格记录合并,并将合并后的结果存储在内存中的 SQL 数据库中,以便后续查询。

This pipeline implements an end-to-end NLP workflow that transforms unstructured product review text into structured tabular data. It achieves this by extracting named entities such as customer names and purchase dates using a local LLM, combining them with existing tabular records, and storing the merged results in an in-memory SQL database for subsequent querying.

该系统采用模块化和智能体设计,利用 LangGraph 库定义有状态图转换,并使用 Ollama 进行逻辑逻辑推理。这种模式支持可扩展、可解释的企业数据集成和问答( QA ) 工作流程。

The system is modular and agentic in design, leveraging the LangGraph library to define stateful graph transitions and Ollama for LLM inference. This pattern supports scalable, interpretable workflows for enterprise data integration and question answering (QA).

下图以可视化的方式展示了基于 LangGraph 的文本到 SQL 预处理流程。它捕捉了多个代理节点之间的条件执行逻辑,这些节点负责解析列语义、生成基于 LLM 的链、提取结构化数据、合并数据集以及填充 SQL 可访问的数据库。该流程从基于列描述可用性的条件入口点开始,并沿着确定性路径进行实体提取和数据整合。重试分支确保了系统的健壮性,而最终决策节点则允许系统填充数据库或优雅地终止。这种模块化设计实现了对 NLP 任务的可解释、状态感知编排。

The following figure visually represents the LangGraph-based workflow for a text-to-SQL preprocessing pipeline. It captures the conditional execution logic between multiple agent nodes responsible for parsing column semantics, generating LLM-based chains, extracting structured data, merging datasets, and populating an SQL-accessible database. The flow begins with a conditional entry point based on the availability of column descriptions and proceeds through a deterministic path of entity extraction and data consolidation. A retry branch ensures robustness, while the final decision node allows the system to either populate the database or terminate gracefully. This modular design enables interpretable, state-aware orchestration of NLP tasks.

流程图显示了一个工作流程,该工作流程从 execute_column_name_agent(带有重试循环)或 execute_chain_creation_agent 开始,然后依次执行 execute_entity_extraction_agent、execute_data_combination_agent、execute_database_agent,最后结束。

图 14.2:使用文本到 SQL 进行实体提取的 LangGraph 工作流程

Figure 14.2: LangGraph workflow for entity extraction using text-to-SQL

完整的端到端代码已在 GitHub 存储库中提供,您可以在那里找到并了解各种架构方法进行实验。

The complete end-to-end code is provided in the GitHub repository, where you can find and understand various architectural approaches to experiment with.

架构概述

Architecture overview

该实现采用基于 LangGraph 框架的多智能体系统架构。工作流程由五个主要智能体组成,每个智能体封装一个离散功能,具体如下:

The implementation is architected as a multi-agent system using the LangGraph framework. The workflow is composed of five main agents, each encapsulating a discrete function, which are as follows:

  • ColumnNameAgent :它解析和构建用户定义的列描述。
  • ColumnNameAgent: It parses and structures user-defined column descriptions.
  • ChainCreationAgent :它使用结构化提示和 JSON 解析器创建基于 LLM 的提取链。
  • ChainCreationAgent: It creates an LLM-based extraction chain using structured prompts and a JSON parser.
  • EntityExtractionAgent :它应用 LLM 管道从自然语言评论中提取数据。
  • EntityExtractionAgent: It applies the LLM pipeline to extract data from natural language reviews.
  • DataCombinationAgent :它使用 pandas 合并结构化数据和提取数据。
  • DataCombinationAgent: It merges structured and extracted data using pandas.
  • DatabaseAgent :它通过 SQLite 和 SQLAlchemy 将合并的数据转换为 SQL 可访问的格式。
  • DatabaseAgent: It converts merged data into an SQL-accessible format via SQLite and SQLAlchemy.

基于图的控制流控制这些代理之间的转换,支持条件分支、重试逻辑以及对执行顺序的完全控制。

A graph-based control flow governs the transitions between these agents, supporting conditional branching, retry logic, and full control over execution sequencing.

以下步骤概述了代码的详细解读:

The following steps outline a detailed walkthrough of the code:

1. 导入和环境设置:首先,导入所有必要的库,包括用于工作流编排的 LangGraph、用于本地模型交互的 Ollama 以及用于数据处理和存储的 pandas/SQLAlchemy,如下面的代码所示:

1. Imports and environment setup: To begin, all necessary libraries are imported, including LangGraph for workflow orchestration, Ollama for local model interaction, and pandas/SQLAlchemy for data processing and storage, as shown in the following code:

导入 ollama、langgraph

import ollama, langgraph

from langchain_ollama import ChatOllama

from langchain_ollama import ChatOllama

from sqlalchemy import create_engine

from sqlalchemy import create_engine

from sqlalchemy.pool import StaticPool

from sqlalchemy.pool import StaticPool

该环境结合了用于本地语言逻辑模型 (LLM) 推理的 Ollama 和用于声明式工作流定义的 LangGraph。ChatOllama封装器与模型 llama3.2:3b-instruct-fp16 进行交互模型用作命名实体识别( NER ) 引擎。

The environment combines Ollama for local LLM inference and LangGraph for declarative workflow definition. The ChatOllama wrapper interfaces with the model llama3.2:3b-instruct-fp16, which serves as the named entity recognition (NER) engine.

2. 定义共享图状态:该管道使用可变的、类型化的图状态在代理之间传递结构化数据和工件。这种集中式状态设计支持模块化、状态感知转换,如下面的代码所示:

2. Defining the shared graph state: The pipeline uses a mutable, typed graph state to pass structured data and artifacts between agents. This centralized state design supports modular, state-aware transitions, as shown in the following code:

class GraphState(TypedDict):

class GraphState(TypedDict):

问题:str

question: str

...

...

GraphState类型定义了共享可变状态的结构。它包含元数据(例如用户问题)、输入模式、链对象和中间输出。这种设计遵循函数式编程原则,同时支持代理状态在转换过程中的变更

The GraphState type defines the shape of the shared mutable state. It includes metadata (like the user question), input schema, chain objects, and intermediate outputs. This design adheres to functional programming principles while enabling state mutation across agent transitions.

3. ColumnNameAgent :此代理构建一个 LangChain 管道,连接提示器、本地 LLM 和 JSON 解析器。以下是实体识别过程的核心:

3. ColumnNameAgent: This agent constructs a LangChain pipeline, connecting a prompt, a local LLM, and a JSON parser. The following is the core of the entity recognition process:

类 ColumnNameAgent:

class ColumnNameAgent:

def run(self, state):

def run(self, state):

...

...

该代理将用户定义的column_name_str解析为结构化字典column_names 。每个条目将原始列标签映射到语义描述(例如,“名称”: “<客户名称>” )。这些标签指导 LLM 进行下游提取。

This agent parses the user-defined column_name_str into a structured dictionary column_names. Each entry maps a raw column label to a semantic description (e.g., "Name": "<Name of the customer>"). These tags guide the LLM in downstream extraction.

4. ChainCreationAgent :该代理构建一个 LangChain 管道,连接提示器、本地 LLM 和 JSON 解析器。以下是实体识别过程的核心:

4. ChainCreationAgent: This agent constructs a LangChain pipeline, connecting a prompt, a local LLM, and a JSON parser. The following is the core of the entity recognition process:

链创建代理类:

class ChainCreationAgent:

def run(self, state):

def run(self, state):

...

...

a. LLM 配置了一个PromptTemplate ,指示其执行命名实体识别。提示语采用角色特定的语气,并要求结构化输出:

a. The LLM is configured with a PromptTemplate instructing it to perform named entity recognition. The prompt is crafted in a role-specific tone and demands structured output:

template = """您需要扮演命名实体识别者的角色。

template = """You need to act as a Named Entity Recognizer.

b. 从评论文本中提取以下列名称:

b. Extract the following column names from the review text:

{列名}

{column_names}

...

...

必须严格按照 JSON 格式响应,例如:{"column_1": "<value 1>", ... }

STRICTLY respond in JSON format like: {"column_1": "<value 1>", ... }

"""

"""

该代理初始化一个链,连接提示 | LLM | JSON 解析器。

This agent initializes a chain, linking the prompt | LLM | JSON parser.

5. EntityExtractionAgent :现在,基于模型的链式调用应用于评论文本列表,生成用于下游处理的结构化逐行提取值字典,如下面的代码所示:

5. EntityExtractionAgent: The model-powered chain is now invoked over a list of review texts, generating structured row-wise dictionaries of extracted values for downstream processing, as shown in the following code:

EntityExtractionAgent 类:

class EntityExtractionAgent:

def run(self, state):

def run(self, state):

...

...

该链式操作在df2 的ReviewText列上迭代应用。每个 LLM 输出都被解析并收集到extracted_data中,extracted_data 是一个字典列表,表示结构化的行级提取结果。该代理本质上是将 LLM 实现为一个实体提取器。

The chain is applied iteratively over the ReviewText column in df2. Each LLM output is parsed and collected into extracted_data, a list of dictionaries representing structured row-level extractions. This agent essentially operationalizes the LLM as an entity extractor.

6. 数据组合代理:提取的字段与现有表格数据合并,并按关键列对齐。结果是一个包含原始信息和派生信息的完整结构化数据集,如下面的代码所示:

6. DataCombinationAgent: The extracted fields are merged with the existing tabular data, aligning on key columns. The result is a fully structured dataset with both original and derived information, as shown in the following code:

类 DataCombinationAgent:

class DataCombinationAgent:

def run(self, state):

def run(self, state):

...

...

a. 此阶段执行以下连接操作:

a. This stage performs a join between:

i. 原始结构化表df1

i. The original structured table df1

ii. 新提取的数据框 extracted_df

ii. The newly extracted dataframe extracted_df

连接键根据结构化列定义推断得出。结果保存为merged_data ,并以 CSV 文件格式导出到磁盘。

Join keys are inferred from the structured column definitions. The result is saved as merged_data and exported to disk as a CSV file.

7. DatabaseAgent :合并数据集后,该代理会将输出写入驻留在内存中的 SQLite 数据库,使其可通过 SQL 查询访问:

7. DatabaseAgent: After combining the datasets, this agent writes the output to a memory-resident SQLite database, making it accessible via SQL queries:

数据库代理类:

class DatabaseAgent:

def run(self, state):

def run(self, state):

...

...

合并后的数据会持久化到一个临时的 SQLite 数据库中。SQLAlchemy 配置了一个静态连接池 (StaticPool),以确保内存连接在不同会话之间保持有效。这使得下游的 LLM 或应用程序无需完整的关系数据库管理系统 (RDBMS)即可执行 SQL 查询

Here, the merged data is persisted in a transient SQLite database. SQLAlchemy is configured with a StaticPool to ensure the in-memory connection remains valid across sessions. This enables downstream LLMs or applications to perform SQL queries without requiring a full relational database management system (RDBMS).

8. 图定义和工作流编译每个代理都作为节点添加到 LangGraph 中。条件边根据数据可用性和重试逻辑定义执行路径,如下面的代码所示:

8. Graph definition and workflow compilation: Each agent is added to the LangGraph as a node. Conditional edges define execution paths based on data availability and retry logic, as shown in the following code:

工作流 = 状态图(GraphState)

workflow = StateGraph(GraphState)

...

...

graph = workflow.compile()

graph = workflow.compile()

每个代理都作为 LangGraph 中的一个节点添加,它们之间存在条件转换。入口点路由和重试逻辑由自定义函数decide_entry_pointdecide_next_step控制

Each agent is added as a node in the LangGraph, with conditional transitions between them. Entry point routing and retry logic are controlled by custom functions decide_entry_point and decide_next_step.

9. 工作流执行:自定义运行器模拟图的逐步执行。此循环处理节点路由、转换和错误解决,如下所示:

9. Workflow execution: A custom runner simulates step-by-step execution of the graph. This loop handles node routing, transitions, and error resolution, as shown:

def process_workflow(state):

def process_workflow(state):

...

...

`process_workflow`函数按顺序执行流程。这是 LangGraph 图遍历的线性化版本。它会手动执行每个阶段,直到到达终点,并在每次转换时记录输出。

The process_workflow function executes the pipeline sequentially. This is a linearized version of LangGraph's graph traversal. It manually steps through each phase until the end is reached, logging output at each transition.

10. 初始化和示例运行:最后,使用示例数据初始化状态,并执行完整的流程。输出包括最终的结构化数据集和一个用于 SQL 访问的实时数据库引擎,如下面的代码所示:

10. Initialization and example run: Finally, sample data is used to initialize the state, and the full workflow is executed. Outputs include the final structured dataset and a live database engine for SQL access, as shown in the following code:

initial_state = GraphState(...)

initial_state = GraphState(...)

a. 两个示例数据框(df1df2 )模拟客户数据集及其对应的评论文本。执行后,最终状态包括以下内容:

a. Two toy DataFrames (df1 and df2) simulate a customer dataset and corresponding review texts. Upon execution, the final state includes the following:

i. 提取的实体

i. Extracted entities

ii. 合并数据集

ii. Merged dataset

三、 数据库引擎

iii. Database engine

这种设置便于对结构化结果进行下游查询、可视化或 LLM 辅助分析。

This setup facilitates downstream querying, visualization, or LLM-assisted analytics over the structured result.

此工作流程展示了一种可组合、可解释的架构,它利用本地语言逻辑模型 (LLM) 和基于图的编排,将自然语言数据转换为可用于 SQL 的格式。模块化代理设计增强了可解释性和错误隔离能力,而 LangGraph 则实现了对溢出逻辑和重试机制的灵活控制。此类系统在客户支持自动化、电子商务分析和评论摘要流程中具有重要价值。

This workflow demonstrates a composable, interpretable architecture for turning natural language data into SQL-ready form using local LLMs and graph-based orchestration. The modular agent design enhances explainability and error isolation, while LangGraph enables flexible control of overflow logic and retries. Such systems are valuable in customer support automation, e-commerce analytics, and review summarization pipelines.

第 15 章代理文本到 SQL 系统和架构决策”中提供了端到端的多数据库代理实现

An end-to-end Multi DB Agentic implementation is available in Chapter 15, Agentic Text-to-SQL Systems and Architecture Decision-Making.

提高数据可访问性和可读性

Enhance data accessibility and literacy

在当今以数据为中心的经济中,获取可操作的信息对于决策、创新和运营效率至关重要。然而,绝大多数有价值的数据都存储在结构化的关系数据库中,这些数据库通常难以被非技术用户访问。这些用户通常缺乏编写 SQL 查询、理解复杂模式或使用学习曲线陡峭的商业智能 (BI) 工具所需的专业知识。文本到 SQL 系统使用户能够使用自然语言与结构化数据交互,有望通过显著提高数据可访问性并提升组织各层级的数据素养来改变这一现状。

In the modern data-centric economy, access to actionable information is critical for decision-making, innovation, and operational efficiency. Yet the vast majority of valuable data resides in structured relational databases that are often inaccessible to non-technical users. These users typically lack the expertise to write SQL queries, understand schema complexity, or navigate BI tools with steep learning curves. Text-to-SQL systems, which enable users to interact with structured data using natural language, are poised to transform this landscape by dramatically increasing data accessibility and promoting data literacy across organizational hierarchies.

以下列表探讨了文本到 SQL 系统如何改变数据访问格局,从而打造更具包容性、更敏捷、更具备数据素养的员工队伍:

The following list explores how text-to-SQL systems are transforming the landscape of data accessibility, enabling a more inclusive, agile, and data-literate workforce:

  • 弥合技术鸿沟:传统上,数据库查询一直是数据分析师、数据库管理员或软件工程师的专属领域。销售经理、产品负责人和人力资源专业人员等业务用户通常不得不依赖这些技术专家从数据中提取洞见。这造成了洞见生成的瓶颈和延迟,限制了响应速度和创新能力。

    文本转 SQL 系统通过允许用户使用自然语言表达查询来消除这一障碍。例如,用户无需等待数据分析师编写 SQL 查询,市场经理可以输入“显示上个月所有转化为客户的潜在客户” 。系统会自动将此输入转换为 SQL 查询语句,执行查询并立即可视化呈现。这样一来,反馈循环速度更快,业务用户就能独立做出基于数据的决策。

  • Bridging the technical divide: Traditionally, querying databases has been the domain of data analysts, database administrators, or software engineers. Business users, such as sales managers, product owners, and HR professionals, are typically forced to rely on these technical experts to extract insights from data. This creates bottlenecks and delays in insight generation, limiting responsiveness and innovation.

    Text-to-SQL systems remove this barrier by allowing users to express their queries in plain language. For example, instead of waiting for a data analyst to write a SQL query, a marketing manager could type, show me all leads from last month who converted into customers. This input is translated automatically into SQL, executed, and visualized instantly. The result is a faster feedback loop, empowering business users to make data-informed decisions independently.

  • 大型组织的数据民主化:随着组织规模的扩大,数据不仅在物理上因部门而异,而且在认知上也因知识领域而异,形成数据孤岛。不同的团队可能使用不同的术语或以独特的方式解读指标。文本到 SQL 系统通过提供一个通用的自然语言界面,并根据企业的词汇进行定制,来帮助统一访问数据。

    通过嵌入特定领域的提示并利用模式感知提示,这些系统可以同时服务于多个部门,而无需他们了解底层数据库结构。这种民主化促进了透明度、跨职能协作以及对关键指标的共同理解。

  • Democratizing data in large organizations: As organizations scale, data becomes siloed both physically across departments and cognitively across knowledge domains. Different teams may use different terminologies or interpret metrics in unique ways. Text-to-SQL systems help to unify access by presenting a common natural language interface, customized to the enterprise’s vocabulary.

    By embedding domain-specific prompts and leveraging schema-aware prompting, these systems can accommodate multiple departments without requiring them to understand the underlying database structure. This democratization promotes transparency, cross-functional collaboration, and a shared understanding of key metrics.

  • 培养数据素养:数据素养是指阅读、处理、分析和交流数据的能力。在数字经济中,这是一项至关重要的技能,但由于工具和培训方面的障碍,许多组织的数据素养仍然不足。

    文本转 SQL 降低了用户与数据交互的门槛。它允许用户使用自然语言构建和迭代数据问题,从而帮助用户直观地理解数据结构以及如何利用数据回答业务问题。随着时间的推移,用户会逐渐形成数据模式的心理模型,理解连接和筛选等操作,甚至提升自身的问题构建能力。

    此外,一些教育平台将文本转SQL作为一种教学工具。学习者可以用英语输入问题,并查看这些问题如何映射到SQL语法。这种交互式学习过程有助于理解问题,并增强学习者探索数据的信心。

  • Fostering a culture of data literacy: Data literacy refers to the ability to read, work with, analyze, and communicate with data. It is a vital skill in the digital economy, yet it remains underdeveloped in many organizations due to barriers in tooling and training.

    Text-to-SQL lowers the entry point for engaging with data. By allowing users to formulate and iterate on data questions in natural language, it builds an intuitive understanding of how data is structured and how it can answer business questions. Over time, users begin to develop mental models of the schema, understand joins and filters, and even improve their question formulation skills.

    Additionally, some educational platforms use text-to-SQL as a teaching tool. Learners can input questions in English and see how they map to SQL syntax. This interactive learning process supports comprehension and builds confidence in data exploration.

  • 赋能实时决策:决策速度往往受限于信息获取的及时性。在电子商务、物流和金融等快节奏行业,等待数据请求响应可能会导致错失良机。文本转SQL系统,尤其是在嵌入仪表盘、聊天界面或移动应用中时,能够让一线员工按需获取信息。
    • 例如
      • 仓库经理可以询问,3 区哪些商品库存不足?
      • 财务规划师可能会问,第三季度的同比收入变化是多少?
      • 医疗保健提供者可以询问,过去 7 天内有多少哮喘患者就诊?

        这些问题被转化为可执行的查询并立即执行,从而减少摩擦并实现实时决策。

  • Empowering real-time decision-making: The speed of decision-making is often constrained by the availability of insights. In fast-paced industries, such as e-commerce, logistics, and finance, waiting for a data request to be fulfilled can result in lost opportunities. Text-to-SQL systems, especially when embedded in dashboards, chat interfaces, or mobile apps, allow frontline workers to obtain insights on demand.
    • For example:
      • A warehouse manager can ask, which items are running low in Zone 3?
      • A financial planner can ask, what was the YoY revenue change for Q3?
      • A healthcare provider can ask, how many patients with asthma visited in the last 7 days?

        These questions are transformed into executable queries and delivered instantly, reducing friction and enabling real-time decisions.

  • 增强数据治理和可追溯性:当用户手动编写 SQL 时,版本控制和访问控制难以实施。文本转 SQL 系统集中生成查询,从而更容易实施:
    • 基于角色的数据访问权限。
    • 记录并审核所有查询。
    • 通过模板实现一致的指标定义。

    这可以提高对数据的信任度,降低误解的风险,并有助于遵守内部政策或监管标准。

  • Enhancing data governance and traceability: When users manually write SQL, version control and access control become hard to enforce. Text-to-SQL systems centralize query generation, making it easier to enforce:
    • Role-based access to data.
    • Logging and auditing of all queries.
    • Consistent metrics definitions via templates.

    This increases trust in data, reduces the risk of misinterpretation, and supports compliance with internal policies or regulatory standards.

  • 弥合好奇心与能力之间的差距:数据工作流中一个隐性成本是抑制好奇心。当用户意识到解答数据问题过于困难或耗时过长时,他们就会停止提问。文本转 SQL 通过实现快速迭代,重新激发了用户的好奇心。用户可以提出后续问题、重新表述问题或进行深入挖掘,而无需再次联系分析师或工程师。

    这有助于在整个组织内培养更具探索性和洞察力的思维方式,从静态仪表板转向动态​​查询。

  • Closing the gap between curiosity and capability: One of the hidden costs in data workflows is the suppression of curiosity. When users know that it is too difficult or takes too long to get a data question answered, they stop asking. Text-to-SQL reactivates this curiosity by enabling fast iteration. Users can ask a follow-up question, rephrase, or drill down without needing to re-engage an analyst or engineer.

    This cultivates a more exploratory, insight-driven mindset across the organization, moving from static dashboards to dynamic querying.

  • 支持包容性和全球访问:在多语言或无障碍环境中,文本转 SQL 系统可以支持本地化输入或语音查询,从而扩大数据访问权限,惠及更多用户。通过适当的培训和界面设计,即使是不识字或视力障碍的用户也可以使用语音或翻译后的查询语句来访问数据库。

    这使得文本转 SQL 不仅成为一项技术创新,而且成为实现数字包容的关键推动因素。

  • Supporting inclusive and global access: In multilingual or accessibility-aware environments, text-to-SQL systems can support localized inputs or voice queries, broadening data access to a wider range of users. With proper training and interface design, even non-literate or visually impaired users could query databases using speech or translated queries.

    This positions text-to-SQL not only as a technical innovation but also as a key enabler of digital inclusion.

文本转SQL技术不仅仅是一种技术便利,更是实现数据赋能的战略推动力。通过允许用户用自己的语言提问并获得基于结构化数据的可靠答案,这些系统打破了数据与人之间长期存在的壁垒。它们支持自助式分析,培养求知欲,并提升整个组织的数据素养。在未来几年,成功采用文本转SQL系统可能成为那些力求在战略和执行层面真正实现数据驱动的组织的关键差异化因素。

Text-to-SQL technology is more than a technical convenience; it is a strategic enabler of widespread data empowerment. By allowing users to ask questions in their own words and receive reliable answers grounded in structured data, these systems break down longstanding barriers between data and people. They enable self-service analytics, foster a culture of curiosity, and elevate the data literacy of an organization as a whole. In the coming years, the successful adoption of text-to-SQL systems may be a key differentiator for organizations that seek to be truly data-driven in both strategy and execution.

绩效指标和最佳实践

Performance metrics and best practices

评估文本到 SQL 系统的性能是一项复杂且多方面的任务。这些系统生成的并非简单的标签或连续值,而是结构化的查询,这些查询必须语法正确且语义符合用户的意图。此外,SQL 查询通常有多种正确的表达方式,这进一步增加了评估的难度。本节将介绍文本到 SQL 研究和实践中使用的关键评估指标,提供详细的定义和使用指南,并介绍实际部署的最佳实践。

Evaluating the performance of text-to-SQL systems is a complex and multifaceted task. These systems do not produce simple labels or continuous values; instead, they generate structured queries that must be both syntactically correct and semantically aligned with the user's intent. Moreover, there is often more than one correct way to express a query in SQL, which complicates evaluation further. This section introduces key evaluation metrics used in text-to-SQL research and practice, providing detailed definitions and guidance on their use, followed by best practices for real-world deployments.

精确匹配准确度

Exact match accuracy

精确匹配准确率衡量的是生成的 SQL 查询与参考(真实)查询完全匹配的百分比,包括所有元素,例如子句、表名、别名和格式。这是一个非常严格的指标,即使是细微的差异(例如不同的连接顺序或别名的使用)也会被视为错误。它通常应用于基准数据集,例如 Spider,在这些数据集中,真实查询是可用的,并且任务被定义为从自然语言到 SQL 的一对一映射。

Exact match accuracy measures the percentage of generated SQL queries that match the reference (ground truth) queries exactly, including all elements such as clauses, table names, aliases, and formatting. This is a strict metric where even a minor variation (such as a different join order or use of an alias) is considered an error. It is typically applied in benchmark datasets like Spider, where ground truth is available and the task is framed as one-to-one mapping from natural language to SQL.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 计算和解释都很简单。
    • 支持在共享基准测试上对不同型号进行直接比较。
  • Advantages:
    • Simple to compute and interpret.
    • Enables direct comparison across models on shared benchmarks.
  • 局限性
    • 对语法不同但有效的查询惩罚过重。
    • 忽略语义等价性和结果正确性。
    • 不适用于具有灵活模式的开放域或生产系统。
  • Limitations:
    • Over-penalizes valid but syntactically different queries.
    • Ignores semantic equivalence and result correctness.
    • Not suitable for open-domain or production systems with flexible schemas.
  • 使用场景:主要用于提供标准 SQL 模板和固定模式的研究基准测试。
  • Use case: Primarily used in research benchmarks where standard SQL templates and fixed schemas are provided.

执行准确率

Execution accuracy

执行准确率评估生成的 SQL 语句在目标数据库上执行时,是否返回与参考 SQL 查询相同的结果。它直接比较两个查询的输出,如果结果集匹配,则认为它们相等,而与查询结构无关。

Execution accuracy evaluates whether the generated SQL, when executed on the target database, returns the same result as the reference SQL query. It directly compares the output of both queries and considers them equal if the result sets match, regardless of query structure.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 更符合用户期望(正确答案比 SQL 语法更重要)。
    • 能够容忍句法变化和别名。
    • 即使在复杂的查询中,也允许进行等效性测试。
  • Advantages:
    • Better aligns with user expectations (correct answers matter more than SQL syntax).
    • Tolerates syntactic variation and aliasing.
    • Allows equivalence testing even in complex queries.
  • 局限性
    • 需要访问测试或生产数据库实例。
    • 对数据分布或行顺序的变化很敏感。
    • 不能应用于涉及非确定性操作的查询(例如,random() 没有 ORDER BY 的 LIMIT )。
  • Limitations:
    • Requires access to a test or production database instance.
    • Sensitive to changes in data distribution or row ordering.
    • Cannot be applied to queries involving non-deterministic operations (e.g., random(), LIMIT without ORDER BY).
  • 使用案例:在实际系统和生产评估中,其目标是确保用户获得正确的信息,因此是首选
  • Use case: Preferred in practical systems and production evaluations where the goal is to ensure the user receives the correct information.

组件级精度

Component-level accuracy

该指标将 SQL 查询分解为SELECT WHERE GROUP BY ORDER BY HAVINGJOIN子句等结构组件。它衡量相对于参考查询,这些组件中有多少被正确预测。

This metric decomposes the SQL query into structural components such as SELECT, WHERE, GROUP BY, ORDER BY, HAVING, and JOIN clauses. It measures how many of these components are correctly predicted relative to the reference query.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 提供系统性能的详细分析。
    • 有助于诊断查询生成管道的哪些部分需要改进。
    • 可以按重要性或频率加权。
  • Advantages:
    • Provides granular insights into system performance.
    • Useful for diagnosing which parts of the query generation pipeline need improvement.
    • Can be weighted by importance or frequency.
  • 局限性
    • 需要一个结构化的 SQL 解析树。
    • 不适合作为单一指标,必须与其他指标结合使用。
  • Limitations:
    • Requires a structured SQL parse tree.
    • Not suitable as a sole metric, must be used alongside others.
  • 使用案例最适合调试模型、教学工具或在迭代开发过程中跟踪改进。
  • Use case: Best for debugging models, instructional tools, or tracking improvement during iterative development.

查询执行成功率

Query execution success rate

该指标衡量的是生成的 SQL 查询中,能够在数据库上成功执行且不触发语法或运行时错误的百分比。它并不评估结果的正确性,而只评估查询是否能够运行。

This metric measures the percentage of generated SQL queries that can be executed successfully on the database without triggering syntax or runtime errors. It does not assess the correctness of results, only whether the query can run.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 表明模型输出的稳健性和句法有效性。
    • 可用于跟踪生产系统运行状况。
  • Advantages:
    • Indicates robustness and syntactic validity of the model output.
    • Useful for tracking production system health.
  • 局限性
    • 忽略语义错误(例如,错误的逻辑、错误的过滤器)。
    • 如果语法正确但格式错误的查询通过,则可能高估质量。
  • Limitations:
    • Ignores semantic errors (e.g., wrong logic, wrong filters).
    • May overstate quality if malformed but syntactically valid queries pass.
  • 使用案例:用于生产系统中的持续监控和安全检查。
  • Use case: Used in production systems for continuous monitoring and safety checks.

语义等价和规范化

Semantic equivalence and canonicalization

语义等价性测试旨在确定两个 SQL 查询语句尽管语法形式不同,但功能是否相同。这通常需要在比较之前对查询语句进行规范化或归类(例如,移除别名、重新排列连接顺序)。

Semantic equivalence testing aims to determine whether two SQL queries are functionally identical despite differing syntactic forms. This often involves normalizing or canonicalizing the queries (e.g., removing aliases, reordering joins) before comparing them.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 准确把握查询的真正意图。
    • 能够处理灵活的查询结构和表达式。
  • Advantages:
    • Captures the true intent of the query.
    • Handles flexible query structures and expressions.
  • 局限性
    • 需要复杂的 SQL 解析器和语义分析器。
    • 在极端情况下可能会产生假阳性或假阴性结果。
  • Limitations:
    • Requires sophisticated SQL parsers and semantic analyzers.
    • May produce false positives or negatives in edge cases.
  • 使用场景:建议在执行测试不切实际或预期有多个有效输出的高级评估中使用。
  • Use case: Recommended in advanced evaluations where execution testing is impractical or multiple valid outputs are expected.

人为评估

Human evaluation

人工评估是指由专家审核员根据正确性、清晰度、相关性和效率等标准评估生成的 SQL 语句的质量。审核员可以手动执行查询,也可以对照模式检查其逻辑。

Human evaluation involves expert reviewers assessing the quality of the generated SQL based on criteria such as correctness, clarity, relevance, and efficiency. Reviewers may manually execute queries or inspect their logic against the schema.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 提供细致入微、与情境相关的评估。
    • 能够发现细微的语义问题或极端情况。
  • Advantages:
    • Provides nuanced, context-sensitive assessment.
    • Can catch subtle semantic issues or edge cases.
  • 局限性
    • 既昂贵又耗时。
    • 除非采用明确的标准化规则,否则都是主观的。
  • Limitations:
    • Expensive and time-consuming.
    • Subjective unless standardized with clear rubrics.
  • 使用案例:非常适合试点部署、面向用户的评估或解决模糊案例。
  • Use case: Ideal for pilot deployments, user-facing evaluations, or resolving ambiguous cases.

延迟和吞吐量指标

Latency and throughput metrics

这些运行指标衡量生成 SQL 查询所需的时间(延迟)以及在给定时间段内可以处理的查询数量(吞吐量)。它们反映了系统的响应速度和可扩展性。

These operational metrics measure the time required to generate a SQL query (latency) and the number of queries that can be processed in a given time period (throughput). They indicate system responsiveness and scalability.

以下列表概述了其优点、局限性和应用场景:

The following list outlines the advantages, limitations, and use cases:

  • 优势
    • 有助于用户体验优化。
    • 有助于识别流程中的瓶颈。
  • Advantages:
    • Useful for UX optimization.
    • Helps identify bottlenecks in processing pipelines.
  • 局限性
    • 受硬件、缓存、模型大小和数据库配置的影响。
    • 与查询准确率无关。
  • Limitations:
    • Influenced by hardware, caching, model size, and database configuration.
    • Not related to query accuracy.
  • 使用案例:用于生产系统,以确保性能服务级别协议SLA )得到满足。
  • Use case: Used in production systems to ensure performance service level agreements (SLAs) are met.

绩效评估最佳实践

Best practices for performance evaluation

评估文本到 SQL 系统不仅仅是检查输出是否正确——它需要一种结构化的方法,平衡量化指标、代表性数据集和持续改进。以下实践有助于确保进行有意义且可靠的性能评估:

Evaluating text-to-SQL systems requires more than checking for correct outputs—it demands a structured approach that balances quantitative metrics, representative datasets, and continuous improvement. The following practices help ensure meaningful, reliable performance assessment:

  • 结合使用多种指标:没有单一指标能够全面反映质量。应结合执行准确率、精确匹配度和组件准确率,才能进行全面评估。
  • Use multiple metrics in combination: No single metric captures all aspects of quality. Combine execution accuracy, exact match, and component accuracy to achieve a comprehensive evaluation.
  • 构建具有代表性的评估数据集:确保测试数据包含各种查询类型:
    • 简单式与嵌套式
    • 单表连接与多表连接
    • 领域特定术语
  • Construct representative evaluation sets: Ensure test data includes varied query types:
    • Simple vs. nested
    • Single table vs. multi-join
    • Domain-specific terminology

这确保了强大的泛化能力。

This ensures robust generalization.

  • 仔细建立真实标签:对于自定义数据集,必须验证手动添加的 SQL 注释的正确性和一致性。添加注释或自然语言释义以辅助评估。
  • Establish ground truth carefully: For custom datasets, manual SQL annotation must be validated for correctness and consistency. Include comments or natural language paraphrases to assist evaluation.
  • 跟踪模型故障模式:错误分类如下:
    • 模式不匹配
    • 聚合错误
    • 逻辑不一致
    • 模棱两可的解释

    分析这些模式有助于快速改进和调整模型。

  • Track model failure modes: Categorize errors are as follows:
    • Schema mismatch
    • Incorrect aggregation
    • Logical inconsistency
    • Ambiguous interpretation

    Analysing these patterns helps in prompt refinement and model tuning.

  • 部署持续评估循环:在生产环境中,实施管道以监控性能随时间的变化,包括:
    • 漂移检测(随着模式或查询类型的演变)
    • 误差跟踪和回归测试
    • 收集用户反馈以进行微调
  • Deploy continuous evaluation loops: In production environments, implement pipelines to monitor performance over time, including:
    • Drift detection (as schema or query types evolve)
    • Error tracking and regression testing
    • User feedback collection for fine-tuning

衡量文本到 SQL 系统的性能,不仅仅需要准确性。一个全面的评估框架必须包含语法、语义和操作指标。从精确匹配和执行正确性到延迟和人工反馈,这些指标为模型改进、部署就绪和用户信任奠定了基础。随着这些系统的不断发展,标准化评估方法对于衡量进展、确保公平性以及指导大规模实际应用至关重要。

Measuring the performance of text-to-SQL systems demands more than accuracy alone. A holistic evaluation framework must incorporate syntactic, semantic, and operational metrics. From exact matches and execution correctness to latency and human feedback, these metrics provide the foundation for model improvement, deployment readiness, and user trust. As these systems continue to evolve, standardizing evaluation methodologies will be essential for benchmarking progress, ensuring fairness, and guiding practical adoption at scale.

结论

Conclusion

本章全面介绍了文本到 SQL 系统的基础组件,弥合了自然语言查询和结构化数据访问之间的鸿沟。我们首先探讨了文本到 SQL 的基本概念,包括模式链接、SQL 生成以及语言学习模型 (LLM) 在解释用户模糊意图中的作用。然后,我们研究了系统架构模式,从简单的提示策略到基于代理的执行图。实际应用案例展示了文本到 SQL 如何赋能 BI、医疗保健、教育和金融等领域的用户。我们还分析了与模式对齐、验证和部署相关的技术挑战,并介绍了实施和性能评估的最佳实践。总而言之,这些见解为设计可靠、可扩展且以用户为中心的文本到 SQL 解决方案提供了蓝图。

This chapter has provided a comprehensive introduction to the foundational components of text-to-SQL systems, bridging the gap between natural language queries and structured data access. We began by exploring the basic concepts underlying text-to-SQL, including schema linking, SQL generation, and the role of LLMs in interpreting ambiguous user intent. We then examined system architecture patterns, ranging from simple prompting strategies to agent-based execution graphs. Real-world applications demonstrated how text-to-SQL can empower users across domains such as BI, healthcare, education, and finance. We also analyzed the technical challenges associated with schema alignment, validation, and deployment, followed by best practices for implementation and performance evaluation. Together, these insights offer a blueprint for designing reliable, scalable, and user-centric text-to-SQL solutions.

文本转SQL不仅仅是一项技术创新,它代表着人们与数据交互方式的根本性转变。通过降低查询关系数据库的门槛,它提升了组织的数据素养,并加快了跨角色和职能部门的决策速度。

Text-to-SQL is not merely a technical innovation; it represents a fundamental shift in how individuals interact with data. By lowering the barrier to querying relational databases, it promotes organizational data literacy and accelerates decision-making across roles and functions.

下一章将介绍一种先进的、基于智能体的多查询文本到 SQL 系统。我们将探讨基于 LLM 的智能体如何协作处理多轮对话、连接推理和查询分解,从而在复杂的真实环境中实现稳健且可解释的数据检索。

The next chapter will introduce an advanced, agentic multi-query text-to-SQL system. We will explore how LLM-powered agents can collaborate to handle multi-turn dialogues, join reasoning, and query decomposition, enabling robust and explainable data retrieval in complex, real-world environments.

C第十五智能体文本到 SQL 系统和架构决策

CHAPTER 15Agentic Text-to-SQL Systems and Architecture Decision-Making

介绍

Introduction

本章将从上一章结束的地方继续探讨。智能体文本到结构化查询语言SQL )系统的出现,标志着人类与结构化数据交互方式的重大革新。这些系统不再依赖静态规则或预定义模板,而是利用由大型语言模型LLM )、检索机制和推理框架(例如 LangChain)驱动的自主智能体,将自然语言问题动态地转换为可执行的 SQL 查询。本章将探讨设计此类智能系统所需的架构和决策策略。

In this Chapter, we start from where we left off in the last chapter. Agentic text-to-Structured Query Language (SQL) systems represent a significant evolution in how humans interact with structured data. Rather than relying on static rules or pre-defined templates, these systems use autonomous agents, powered by large language models (LLMs), retrieval mechanisms, and reasoning frameworks like LangChain, to dynamically translate natural language questions into executable SQL queries. This chapter explores the architecture and decision-making strategies required to design such intelligent systems.

这些架构的核心是一个多步骤的编排流程,包括查询嵌入、语义搜索、模式匹配、SQL 生成和联合执行。从基于句子转换器的嵌入到 LangChain 的 ReAct 代理(带有思维链( CoT ) 提示),每个组件都在维护准确性、适应性和透明度方面发挥着至关重要的作用。全局索引、模式匹配器和预过滤的使用确保了代理能够以最小的错误或歧义处理跨数据库查询。

At the core of these architectures lies a multi-step orchestration process involving query embedding, semantic search, schema matching, SQL generation, and federated execution. Each component—from Sentence Transformer-based embedding to LangChain’s ReAct agent with chain of thought (CoT) prompting, plays a crucial role in maintaining accuracy, adaptability, and transparency. The use of global indexes, schema matchers, and pre-filtering ensures that the agent can handle cross-database queries with minimal hallucination or ambiguity.

本章详细解析了架构图中所示的完整流程,并解释了关键的设计选择,包括何时使用联邦查询引擎FQE、如何实现索引感知检索,以及如何使用基于LLM的评估器对SQL输出进行评分和重新排序。读完本章,读者将获得一个结构化的蓝图,用于实现可扩展、可靠且可解释的、针对实际企业需求量身定制的智能体文本到SQL系统。

This chapter breaks down the full pipeline shown in the architectural diagram and explains key design choices, including when to use fedrated query engines (FQEs), how to implement index-aware retrieval, and how to score and rerank SQL outputs using LLM-based evaluators. By the end, readers will gain a structured blueprint for implementing scalable, reliable, and interpretable agentic text-to-SQL systems tailored to real-world enterprise needs.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 用于实时零售情报的代理式文本到 SQL 系统
  • Agentic text-to-SQL system for real-time retail intelligence
  • 文本到 SQL 系统的架构和代码说明
  • Architecture and code explanation of text-to-SQL system
  • 逐步管道说明
  • Step-by-step pipeline explanation
  • 文本转 SQL 系统的输出
  • Output from the text-to-SQL system
  • 针对初始问题陈述的解决方案
  • Solution to initial problem statement

目标

Objectives

本章旨在介绍一个模块化且可扩展的框架,用于构建智能文本到 SQL 系统,从而实现跨分布式结构化数据库的自然语言查询。该系统结合了语言学习模型 (LLM)、规划代理、模式感知工具和语义索引,能够智能地将用户查询转换为可执行的 SQL 语句。本章概述了使用 LangChain 的 ReAct 代理、全局索引查找和联合 SQL 执行来开发此类系统的架构、实现和设计权衡。其目标是帮助从业人员构建能够进行自适应推理并准确执行多数据库查询的强大、上下文感知型 SQL 代理。

The objective of this chapter is to present a modular and scalable framework for building agentic text-to-SQL systems that enable natural language querying across distributed, structured databases. By combining LLMs with planning agents, schema-aware tool use, and semantic indexing, the system intelligently translates user queries into executable SQL. This chapter outlines the architecture, implementation, and design trade-offs involved in developing such systems using LangChain's ReAct agent, global index lookup, and federated SQL execution. The goal is to empower practitioners to build robust, context-aware SQL agents capable of adaptive reasoning and accurate multi-database query execution.

用于实时零售情报的代理式文本到 SQL 系统

Agentic text-to-SQL system for real-time retail intelligence

现代零售企业在分布式数据库中生成海量数据,例如客户资料存储在 PostgreSQL 中,订单存储在 MySQL 中,营销日志存储在 MongoDB 中,库存数据则存储在不同的系统中。这些数据孤岛阻碍了快速、数据驱动的决策,尤其对于难以使用 SQL 查询结构化数据库的非技术用户而言更是如此。数据团队手动编写查询语句会导致瓶颈、延误和敏捷性下降。

Modern retail businesses generate massive data across distributed databases, customer profiles in PostgreSQL, orders in MySQL, marketing logs in MongoDB, and inventory in separate systems. These data silos hinder fast, data-driven decisions, especially for non-technical users who struggle to query structured databases using SQL. Manual query writing by data teams leads to bottlenecks, delays, and lost agility.

业务挑战和问题陈述

Business challenge and problem statement

该零售企业希望实现对多个异构数据库中的销售、客户和库存数据的实时、自然语言访问。

The retail enterprise seeks to enable real-time, natural language access to its sales, customer, and inventory data across multiple heterogeneous databases.

以下列出了目前存在的痛点

The following list outlines the current pain points:

  • 决策延迟:分析师和高管依赖数据团队进行 SQL 查询,这减慢了对时间要求较高的行动。
  • Delayed decisions: Analysts and executives depend on data teams for SQL queries, slowing time-sensitive actions.
  • 数据孤岛:信息分散在不兼容的系统中,需要复杂的连接和模式理解。
  • Siloed data: Information is fragmented across incompatible systems, requiring complex joins and schema understanding.
  • 查询延迟:海量数据集(~245GB+)使得传统的 SQL 查询效率低下。
  • Query latency: Massive datasets (~245GB+) make traditional SQL querying inefficient.
  • 收入影响:如果无法立即了解客户流失、库存缺口或产品需求高峰,销售机会往往会错失。
  • Revenue impact: Without instant insight into customer churn, inventory gaps, or product demand spikes, sales opportunities are frequently missed.

因此,我们的目标是开发一个具有模式感知工具、语义索引和联合执行功能的代理文本到 SQL 系统,使业务用户能够进行智能的自助查询,而无需具备 SQL 专业知识。

So, our goal is to develop an agentic text-to-SQL system with schema-aware tool use, semantic indexing, and federated execution to enable intelligent, self-serve querying for business users, without needing SQL expertise.

文本到 SQL 系统的架构和代码说明

Architecture and code explanation of text-to-SQL system

图 15.1展示了完整的文本到 SQL 流水线的概览,说明了从用户查询到 SQL 执行再到最终响应的数据和控制的时间顺序。图中每个步骤都有标签,并对应于一个不同的系统功能:

Figure 15.1 presents a high-level view of the complete text-to-SQL pipeline, illustrating the chronological flow of data and control from user query to SQL execution and final response. how each step in the figure is labeled and corresponds to a distinct system function:

流程图展示了用户与 Langchain 的交互过程,Langchain 使用嵌入模型,并连接到联合查询引擎、多个数据库、全局索引、搜索工具和用于查询执行的模式匹配器,并将结果返回给用户。

图 15.1:完整的文本到 SQL 管道的高级工作流程图

Figure 15.1: A very high-level workflow of the complete text-to-SQL pipeline

15.2展示了一个端到端的智能文本到 SQL 系统架构,该系统旨在连接自然语言查询和结构化的多数据库环境。该流程展示了用户输入如何通过一系列智能步骤转换为 SQL,包括嵌入生成、模式匹配、语义检索和基于 CoT 的 SQL 合成。该系统的核心是 LangChain 的 ReAct 智能体框架,集成了预过滤、LLM 推理、SQL 分级和可选的联合执行。这种架构支持跨孤立数据集进行实时、模式感知的查询,使业务用户无需编写 SQL 即可获取可操作的洞察,从而使企业分析更快、更易于访问且更具上下文关联性。

The following Figure 15.2 architecture illustrates an end-to-end agentic text-to-SQL system designed to bridge natural language queries with structured, multi-database environments. The pipeline showcases how user input is transformed into SQL through a series of intelligent steps, embedding generation, schema matching, semantic retrieval, and CoT-based SQL synthesis. At its core, the system leverages LangChain’s ReAct agent framework, integrating pre-filtering, LLM reasoning, SQL grading, and optional federated execution. This architecture enables real-time, schema-aware querying across siloed datasets, empowering business users to retrieve actionable insights without writing SQL, making enterprise analytics faster, more accessible, and highly contextual.

流程图展示了联合查询过程:用户输入经过句子转换器、模式匹配、过滤和全局索引,从而访问多个数据库。结果使用 LangChain 和 LLM 生成,并返回给用户。

图 15.2:基于代理的文本到 SQL 解决方案设计

Figure 15.2: Solution design of an agentic text-to-SQL solution

逐步管道说明

Step-by-step pipeline explanation

图 15.2展示了完整的文本到 SQL 流水线的概览,说明了从用户查询到 SQL 执行再到最终响应的数据和控制流的先后顺序。步骤说明了图中每个步骤的标签及其对应的系统功能:

Figure 15.2 presents a high-level view of the complete text-to-SQL pipeline, illustrating the chronological flow of data and control from user query to SQL execution and final response. The steps explain how each step in the figure is labeled and corresponds to a distinct system function:

1. 用户输入:当用户提交自然语言查询(例如,通过 Streamlit 界面或应用程序编程接口( API ))时,流程开始。查询将被传递到后端管道进行处理。

1. User input: The process begins when the user submits a natural language query (e.g., via Streamlit interface or application programming interface (API)). The query is passed into the backend pipeline for processing.

2. 查询嵌入生成:使用预训练的句子转换器模型将输入查询转换为向量表示。该嵌入捕捉了查询的语义含义。

2. Query embedding generation: The input query is transformed into a vector representation using a pre-trained Sentence Transformer model. This embedding captures the semantic meaning of the query.

3. 全局索引:查询嵌入被转发到全局索引(例如,在 ChromaDB 中实现),以便基于相似性检索相关的模式摘要或历史查询模式。

3. Global index: The query embedding is forwarded to the global index (e.g., implemented in ChromaDB) for similarity-based retrieval of relevant schema summaries or historical query patterns.

4. LangChain :调用 LangChain 来协调模式摘要、匹配和过滤任务的推理和工具使用。

4. LangChain: LangChain is invoked to orchestrate the reasoning and tool usage for schema summarization, matching, and filtering tasks.

5. 模式匹配器:将查询意图与可用的数据库模式进行比较,以确保所选表和列在语义上与用户的请求一致。

5. Schema matcher: Compares the query intent against available database schemas to ensure the selected tables and columns are semantically aligned with the user’s request.

6. 检查全局索引:将匹配的模式与全局索引进行交叉引用,以验证和改进跨数据库的一致性和相关性选择。

6. Checks the global index: The matched schema is cross-referenced with the global index to validate and refine the selection for consistency and relevance across databases.

7. pre_filter(query_embedding) : 对查询嵌入应用预过滤函数,以减少向量索引中的搜索空间,提高检索效率。

7. pre_filter(query_embedding): A pre-filtering function is applied to the query embedding to reduce the search space in the vector index, improving retrieval efficiency.

8. 对汇总数据进行语义搜索:使用预先过滤的嵌入对汇总的模式/数据执行语义搜索,帮助选择与 SQL 构建最相关的内容。

8. Semantic search on summarized data: Performs a semantic search over summarized schema/data using the pre-filtered embedding, helping select the most relevant content for SQL construction.

9. 使用 LangChain React 代理和 CoT 提示生成 SQL 查询:基于检索到的模式和查询意图,LangChain 代理使用 CoT 提示策略生成初始 SQL 查询,以确保清晰性和正确性。

9. SQL query generation using LangChain React agent and CoT prompt: Based on retrieved schema and query intent, the LangChain agent generates an initial SQL query using a CoT prompting strategy for clarity and correctness.

10. 生成的 SQL 查询:SQL 查询以结构化的可执行形式生成,具有适当的子句(例如,SELECT WHERE ),反映了用户查询的语义含义。

10. SQL query generated: The SQL query is produced in structured executable form, with proper clauses (e.g., SELECT, WHERE) reflecting the semantic meaning of the user’s query.

11. SQL 查询评分:生成的 SQL 由 LLM 进行评估,以验证语法正确性和与原始问题的语义一致性。

11. SQL query is graded: The generated SQL is evaluated by an LLM to verify syntactic correctness and semantic alignment with the original question.

12. 使用 LangChain React 代理、CoT 提示和 FQE(可选)执行 SQL 查询:执行最终的 SQL 查询,可以选择使用FQE聚合来自多个数据库的结果:

12. SQL query execution using LangChain React agent, CoT prompt, and FQE (optional): Executes the final SQL query, optionally using a FQE to aggregate results from multiple databases:

a. 访问全局索引

a. Accesses the global index

b. 跨多个数据库执行 SQL 查询

b. SQL query executed across multiple databases

13. 向 LangChain 发送响应:执行结果被发送回 LangChain 管道进行后处理和格式化。

13. Response sent to LangChain: Execution results are sent back into the LangChain pipeline for post-processing and formatting.

14. 用户响应:最终输出(包括 SQL 查询、检索到的数据以及可选的摘要)将通过用户界面( UI )发送回用户。所有计算均在内存中进行;不会持久化任何中间结果。

14. Response to user: The final output, including the SQL query, retrieved data, and an optional summary, is sent back to the user via the user interface (UI). All computation is in-memory; no intermediate results are persisted.

文件夹结构

Folder structure

为了实现基于代理的文本到 SQL 架构,系统采用模块化设计,在配置、核心逻辑、UI 前端和特定任务模块之间实现了清晰的职责分离。以下文件夹结构展示了使用 LangChain、Ollama 和 ChromaDB 的可扩展实现,支持基于向量的检索和多数据库 SQL 执行。模式匹配、SQL 生成、摘要和查询评分等核心组件被抽象为可重用的任务。`global_index_db /`存储向量索引,而前端则处理用户交互。以下结构支持轻松扩展和稳健地编排整个文本到 SQL 管道,从自然语言输入到联合查询响应:

To operationalize the agentic text-to-SQL architecture, the system is modularly implemented with a clear separation of concerns across configuration, core logic, UI frontend, and task-specific modules. The following folder structure represents a scalable implementation using LangChain, Ollama, and ChromaDB, enabling both vector-based retrieval and multi-database SQL execution. Core components such as schema matching, SQL generation, summarization, and query grading are abstracted into reusable tasks. The global_index_db/ stores vector indexes, while the frontend handles user interaction. The following structure supports easy extension and robust orchestration of the entire text-to-SQL pipeline, from natural language input to federated query response:

终端窗口显示名为 OLLAMA_PIPELINE_WITH_UI 的项目的目录树,显示诸如 config、core、data、frontend、global_index_db、setup、tasks 之类的文件夹,以及诸如 requirements.txt 和 main.py 之类的文件。

图 15.3:代理文本到 SQL 解决方案的文件夹结构

Figure 15.3: Folder structure of agentic text-to-SQL solution

完整的代码可以在 GitHub 代码库中找到。

The end-to-end code is available in the GitHub repository.

要求

Requirements

为了实现用户友好的自然语言查询界面,该系统利用 Streamlit 构建前端 UI。用户可以使用纯英文输入问题,并在 Web 应用中交互式地查看结果。后端逻辑以 API 的形式提供(与 UI 解耦时),系统使用 FastAPI 和 Uvicorn 高效地创建和托管异步端点。这些组件确保用户与后端处理层之间实现无缝交互,而无需深厚的技术知识。

To enable a user-friendly interface for natural language querying, the system leverages Streamlit for the frontend UI. This allows users to input plain English questions and view results interactively in a web application. For serving backend logic as APIs (when decoupled from the UI), FastAPI and Uvicorn are used to create and host asynchronous endpoints efficiently. These components ensure seamless interaction between the user and the backend processing layers without requiring deep technical expertise.

该流程的核心是一个由 LangChain 驱动的强大的智能体推理框架,它利用工具和 CoT 提示来解析用户查询并动态生成 SQL。LangChain 社区模块通过集成增强了此功能。诸如 ChromaDB 和 SQL 连接器之类的工具。句子转换器库为用户查询和文档生成高质量的词嵌入,这些词嵌入存储在高速向量数据库 ChromaDB 中,并可进行检索。这种基于词嵌入的语义检索确保了模式感知和上下文准确的结果。此外,Ollama 还用于运行本地语言学习模型(LLM),例如LlamaMistral ,这些模型执行查询生成、摘要生成和输出验证等任务。

At the core of the pipeline lies a robust agentic reasoning framework powered by LangChain, which utilizes tools and CoT prompting to deconstruct user queries and generate SQL dynamically. LangChain Community modules enhance this functionality with integrations to tools like ChromaDB and SQL connectors. The Sentence Transformers library generates high-quality embeddings for user queries and documents, which are stored and retrieved using ChromaDB, a high-speed vector database. This embedding-based semantic retrieval ensures schema-aware, contextually accurate results. Additionally, Ollama is used to run local LLMs, such as Llama or Mistral, which perform tasks like query generation, summarization, and output validation.

最后,Trino 作为联合 SQL 查询引擎,支持跨多个结构化数据源(例如 PostgreSQL、MySQL)无缝执行 SQL 查询。这确保系统能够实时访问和聚合来自不同数据库的数据。SQLite3 用于轻量级的本地结构化数据集存储,使其成为原型设计或小规模部署的理想选择。结合 API 通信请求和轻量级执行逻辑,该技术栈构成了一个功能强大的本地文本到 SQL 解决方案,无需依赖任何云服务或外部服务。

Finally, Trino serves as the federated SQL query engine that allows seamless execution of SQL across multiple structured data sources (e.g., PostgreSQL, MySQL). This ensures that the system can access and aggregate data from disparate databases in real-time. SQLite3 is used for lightweight local storage of structured datasets, making it ideal for prototyping or small-scale deployment. Combined with requests for API communication and lightweight execution logic, this stack forms a powerful, locally runnable text-to-SQL solution with no cloud or external service dependencies.

安装说明

Setup instructions

以下列出了在本地运行此项目的设置步骤:

The following list outlines the setup steps to run this project locally:

1. 克隆或提取项目:使用以下代码提取并导航到项目文件夹:

1. Clone or extract project: Use the following code to extract and navigate to the project folder:

解压缩 Chapter_15_Text2SQL-main.zip

unzip Chapter_15_Text2SQL-main.zip

cd Text2SQL-main/ollama_pipeline_with_ui

cd Text2SQL-main/ollama_pipeline_with_ui

2. 创建并激活虚拟环境(推荐):使用以下代码创建并激活虚拟环境,以便清晰地管理依赖项:

2. Create and activate a virtual environment (recommended): Create and activate a virtual environment to manage dependencies cleanly using the following code:

python -m venv venv

python -m venv venv

source venv/bin/activate # 在 Windows 系统上:venv\Scripts\activate

source venv/bin/activate # On Windows: venv\Scripts\activate

3. 安装依赖项:使用提供的requirements.txt文件安装所有必要的依赖项:

3. Install dependencies: Use the provided requirements.txt file to install all necessary dependencies:

pip install -r requirements.txt

pip install -r requirements.txt

确保 Ollama 已安装并在本地运行(例如,ollama run mistral )。

Make sure that Ollama is installed and running locally (e.g., ollama run mistral).

4. 初始化数据库:此脚本会在data/目录中创建或填充本地 SQLite 数据库

4. Seed the database: This script creates or fills a local SQLite database located in the data/ directory:

python seed_sqlite_data.py

python seed_sqlite_data.py

这将在data/文件夹内创建或填充本地 SQLite 数据库

This will create or populate a local SQLite database inside the data/ folder.

5. 运行主管道(仅限后端):如果前端应用程序存在于frontend /文件夹中,则使用以下代码启动 UI:

5. Run the main pipeline (backend only): If a frontend app exists in the frontend/ folder, start the UI with the following code:

python main.py

python main.py

如果需要,可以编辑main.py来调用run_query("在这里输入您的查询" ) 。

You can edit main.py to invoke run_query("your query here") if needed.

6. 运行 Streamlit UI(可选) :如果frontend/目录下有 Streamlit 应用,请运行以下代码:

6. Run Streamlit UI (optional): If a Streamlit app is available in frontend/, run the following code:

streamlit 运行 frontend/app.py

streamlit run frontend/app.py

理解每个 Python 脚本

Understanding each Python script

本节将系统地介绍代理文本到 SQL 系统中的所有 Python 源文件。每个模块在将自然语言查询转换为结构化 SQL 响应的过程中都扮演着特定的角色。该架构采用模块化设计,包含代理、任务、核心逻辑、用户界面和设置脚本,以确保灵活性、清晰度和可重用性。本节的解释旨在帮助那些可能不熟悉基于代理的推理、向量搜索或基于 LangChain 的编排的读者。

This section provides a structured walkthrough of all Python source files in the agentic text-to-SQL system. Each module plays a specific role in transforming natural language queries into structured SQL responses. The architecture is modularized into agents, tasks, core logic, UI, and setup scripts to ensure flexibility, clarity, and reusability. The explanations here are aimed at readers who may be new to agent-based reasoning, vector search, or LangChain-based orchestration.

主执行层

Main execution layer

以下列表概述了负责协调系统核心逻辑的层,该层协调数据初始化、查询处理和代理调用,以驱动端到端流程:

The following list outlines the layer that orchestrates the system’s core logic, coordinating data seeding, query handling, and agent invocation to drive the end-to-end flow:

  • main.py :该文件是整个流程的主要协调器。它负责协调模式摘要代理的调用、语义结果的聚合、SQL 查询的生成以及基于 LLM 的质量评估。它定义了一个函数run_query(query) ,该函数是系统端到端工作流的运行骨干。
  • main.py: This file is the primary orchestrator of the pipeline. It coordinates the invocation of the schema summarization agent, aggregation of semantic results, SQL query generation, and LLM-based quality evaluation. It defines a function run_query(query), that serves as the operational backbone of the system’s end-to-end workflow.
  • seed_sqlite_data.py :一个实用脚本,用于向本地 SQLite 数据库填充示例客户、产品和交易数据。这对于初始化可测试环境以及确保实验或演示期间查询执行的可复现性至关重要。
  • seed_sqlite_data.py: A utility script that populates local SQLite databases with sample customer, product, and transaction data. This is essential for initializing a testable environment and ensuring reproducibility of query executions during experimentation or demonstration.

代理模块

Agent modules

以下模块实现了基于 LangChain 的 ReAct 框架的智能代理,这些代理能够分解用户意图、推理模式并准备任务输入:

The following modules implement intelligent agents based on LangChain’s ReAct framework that decompose user intent, reason over schemas, and prepare task inputs:

  • agents/summarization_schema_agent.py :此模块定义了一个 LangChain ReAct 风格的代理,负责模式摘要和用户意图解析。它结合了向量检索、提示链接和工具执行,为下游 SQL 生成做好准备。
  • agents/summarization_schema_agent.py: This module defines a LangChain ReAct-style agent responsible for schema summarization and interpreting user intent. It combines vector retrieval, prompt chaining, and tool execution to prepare the system for downstream SQL generation.
  • agents/sql_agent.py :虽然此文件并未在主流程中直接调用,但它定义了一个能够处理 SQL 特定推理的辅助代理。在未来需要代理组合或回退策略的扩展中,它可能很有用。
  • agents/sql_agent.py: Although not invoked directly in the main pipeline, this file defines a secondary agent capable of handling SQL-specific reasoning. It may be useful in future extensions where agent composition or fallback strategies are needed.
  • agents/__init__.py :它将 agents 模块初始化为 Python 包,从而实现相对导入和模块化组织。
  • agents/__init__.py: It initializes the agents module as a Python package, enabling relative imports and modular organization.

核心基础设施层

Core infrastructure layer

请参考以下列表,其中包括为嵌入、数据库访问、LLM 交互和实用程序逻辑等核心服务提供支持的基础层:

Refer to the following list, which includes the foundational layers powering core services like embeddings, database access, LLM interaction, and utility logic:

  • core/embeddings.py :它使用预训练的Sentence Transformers模型实现文本嵌入生成。它将用户查询和模式描述转换为适合基于相似性检索的密集向量表示。
  • core/embeddings.py: It implements text embedding generation using pre-trained Sentence Transformers models. It transforms user queries and schema descriptions into dense vector representations suitable for similarity-based retrieval.
  • core/chroma_index.py :它与 ChromaDB(一个用于存储和检索词嵌入的向量数据库)进行交互。它支持插入和语义搜索操作,从而能够跨分布式数据集进行模式级理解。
  • core/chroma_index.py: It interfaces with ChromaDB, a vector database used to store and retrieve embeddings. It supports both insertion and semantic search operations, enabling schema-level understanding across distributed datasets.
  • core/llm.py :它处理与通过 Ollama 提供的本地 LLM 的交互。它封装了提示构建和响应解析,用于 SQL 生成、评分和摘要等任务。
  • core/llm.py: It handles interaction with local LLMs served via Ollama. It encapsulates prompt construction and response parsing for tasks such as SQL generation, grading, and summarization.
  • core/sql_executor.py :它执行代理管道生成的 SQL 查询。它连接到本地 SQLite 数据库,旨在支持跨多个数据源的联合查询。
  • core/sql_executor.py: It executes SQL queries generated by the agent pipeline. It connects to local SQLite databases and is designed to support federated querying across multiple data sources.
  • core/sqlite_multi_reader.py :它提供对多个 SQLite 文件的联合访问。它支持从各种特定模式的表中动态选择和检索数据,以实现丰富的聚合功能。
  • core/sqlite_multi_reader.py: It provides federated access to multiple SQLite files. It supports dynamic selection and retrieval from various schema-specific tables to enable rich aggregation.
  • core/cache.py :它实现了针对重复查询或嵌入式查找的基本缓存功能。虽然在默认执行路径中未使用,但它可以提高迭代或长时间运行部署的性能。
  • core/cache.py: It implements basic caching for repeated queries or embedding lookups. Although not used in the default execution path, it can improve performance in iterative or long-running deployments.
  • core/utils.py :它包含支持核心操作的实用函数,例如模式解析、表名提取或数据转换。
  • core/utils.py: It contains utility functions that support core operations such as schema parsing, table name extraction, or data transformation.
  • core/__init__.py :它初始化核心包并支持模块化封装。
  • core/__init__.py: It initializes the core package and supports modular encapsulation.

面向任务的模块

Task-oriented modules

以下组件封装了诸如 SQL 生成、评分、摘要和模式匹配等离散任务,这些任务通常由 LLM 驱动:

The following components encapsulate discrete tasks like SQL generation, grading, summarization, and schema matching, often driven by LLMs:

  • tasks/aggregator.py :它将多个子任务的部分结果(例如,摘要或数据源)合并成一个统一的摘要。这有助于进行整体解读,并更好地契合用户意图。
  • tasks/aggregator.py: It combines partial results from multiple subtasks (e.g., summaries or data sources) into a unified summary. This enables holistic interpretation and better alignment with user intent.
  • tasks/sql_generator.py :它根据结构化的任务输入(例如选定的表、列和筛选条件)构建 SQL 查询。它依赖于 LLM 提示策略来确保查询在上下文和语法上都有效。
  • tasks/sql_generator.py: It constructs SQL queries from structured task inputs such as selected table, columns, and filters. It relies on LLM prompting strategies to ensure queries are contextually and syntactically valid.
  • tasks/grader.py :它使用基于LLM的评分方法,自动评估生成的SQL语句和语义摘要。这有助于确保答案质量和模型可解释性。
  • tasks/grader.py: It provides automated evaluation of both generated SQL and semantic summaries using LLM-based grading. It helps ensure answer quality and model interpretability.
  • tasks/schema_matcher.py :它使用嵌入相似性和启发式方法,将用户查询的元素与相关的模式组件进行匹配。这一步骤对于跨分布式模式准确生成 SQL 至关重要。
  • tasks/schema_matcher.py: It matches the elements of the user's query with relevant schema components using embedding similarity and heuristics. This step is crucial for accurate SQL generation across distributed schemas.
  • tasks/summarizer.py :它生成简洁的自然语言摘要,用于描述模式内容或查询结果。它有助于使系统响应更易于人类阅读和理解。
  • tasks/summarizer.py: It generates concise natural language summaries of schema content or query results. It aids in making system responses human-readable and interpretable.
  • tasks/utils.py :它提供了用于格式化、标记操作和模式验证任务的通用辅助函数。
  • tasks/utils.py: It offers general-purpose helper functions for formatting, token manipulation, and schema validation tasks.
  • tasks/__init__.py :它初始化 tasks 包。
  • tasks/__init__.py: It initializes the tasks package.

前端界面

Frontend interface

该模块提供了一个用户友好的图形界面,可通过 Streamlit 应用程序实现交互式自然语言查询。

This module provides a user-friendly graphical interface, enabling interactive natural language querying through a Streamlit app.

Kfrontend/app.py定义了基于 Streamlit 的系统图形界面。用户可以输入自然语言问题,触发完整的流程,并以交互方式查看结果。这使得非技术用户也能轻松使用该系统。

Kfrontend/app.py defines the Streamlit-based graphical interface for the system. Users can enter natural language questions, trigger the full pipeline, and view results interactively. This makes the system accessible to non-technical users.

系统设置和索引初始化

System setup and index initialization

本节包含用于初始化向量索引并将数据库模式信息嵌入 Chroma 向量存储中,从而为语义检索准备系统的脚本:

This section contains scripts that initialize vector indexes and prepare the system for semantic retrieval by embedding database schema information into the Chroma vector store:

  • setup/populate_chroma.py :它使用从模式内容和元数据中提取的嵌入向量填充 Chroma 向量存储。这使检索系统能够响应语义查询。
    • setup/__init__.py :它初始化 setup 包。
  • setup/populate_chroma.py: It populates the Chroma vector store with embeddings derived from schema content and metadata. This prepares the retrieval system to respond to semantic queries.
    • setup/__init__.py: It initializes the setup package.

这种模块化文件设计体现了现代人工智能系统开发的最佳实践,将用户交互、推理、存储和执行等功能区分开来。每个模块都设计为可独立测试和可扩展,支持可扩展部署和迭代增强。

This modular file design reflects best practices in modern AI system development, separating concerns across user interaction, reasoning, storage, and execution. Each module is designed to be independently testable and extensible, supporting scalable deployment and iterative enhancement.

下一节,让我们来了解代码的内部运作原理。

In the next section, let us understand the inner workings of the code.

代码的内部运作

Inner workings of the code

在本节中,我们将了解使用 LangChain、Ollama、ChromaDB 和 SQLite 实现的代理式文本到 SQL 系统的内部结构和执行逻辑。该项目被组织成清晰的模块化组件,分别代表流程的不同阶段:查询理解、模式概括、SQL 生成、评分和结果聚合。该系统在设计上兼顾了可扩展性和清晰度,采用结构化的文件夹层级和代理驱动的编排层,从而能够将自然语言查询无缝转换为可执行的 SQL。

In this section, we will understand the internal structure and execution logic of an agentic text-to-SQL system implemented using LangChain, Ollama, ChromaDB, and SQLite. The following project is organized into clearly modular components that represent the distinct phases of the pipeline: query understanding, schema summarization, SQL generation, grading, and result aggregation. Designed for extensibility and clarity, the system employs a structured folder hierarchy and an agent-driven orchestration layer to enable seamless translation from natural language queries into executable SQL.

  • 入口点和编排逻辑:程序从main.py开始执行,它作为主入口脚本。该文件协调从输入查询到最终输出的整个流程。核心逻辑封装在函数run_query(query)中,该函数遵循一个定义明确的顺序:

    schema_results = summarization_schema_agent.invoke({"input": query})

    aggregated_result = aggregate_summarized_data(query)

    sql_query = generate_sql(...)

    sql_grade = grade_sql(sql_query)

    summary_grade = grade_summary(aggregated_result["final_summary"])

    该协调器调用摘要代理,聚合检索到的信息,生成 SQL 查询,并随后对 SQL 查询和聚合后的摘要进行评分。最终输出以字典形式返回并记录,但不会持久化到任何文件或数据库中。

  • Entry point and orchestration logic: The execution begins with main.py, which serves as the primary entry script. This file coordinates the complete flow from input query to final output. The core logic is encapsulated in the function run_query(query), which follows a well-defined sequence:

    schema_results = summarization_schema_agent.invoke({"input": query})

    aggregated_result = aggregate_summarized_data(query)

    sql_query = generate_sql(...)

    sql_grade = grade_sql(sql_query)

    summary_grade = grade_summary(aggregated_result["final_summary"])

    This orchestrator calls the summarization agent, aggregates the retrieved information, generates a SQL query, and subsequently grades both the SQL and the aggregated summary. The final output is returned as a dictionary and logged but not persisted to any file or database.

  • 代理配置和角色:文件agents/summarization_schema_agent.py定义了一个基于 LangChain 的 ReAct 代理,负责解释查询意图并与模式相关工具交互。该代理充当初始解释器,并将请求路由到相应的模块进行摘要和模式匹配。

    同级文件agents/sql_agent.py提供了一个可用于更深入 SQL 推理的替代代理。但是,该代理在主编排路径中不会被主动调用。

  • Agent configuration and role: The file agents/summarization_schema_agent.py defines a LangChain-based ReAct agent responsible for interpreting the query intent and interacting with schema-related tools. This agent acts as the initial interpreter and routes the request to appropriate modules for summarization and schema matching.

    The sibling file agents/sql_agent.py offers an alternative agent that may be used for deeper SQL reasoning. However, this agent is not actively invoked in the main orchestration path.

  • 特定任务的功能模块tasks/目录包含功能分离的逻辑。每个模块执行一个范围明确的任务,具体如下:
    • aggregator.py :它整合来自各种摘要器的语义结果。
    • sql_generator.py :它使用 CoT 提示、条件和表元数据构建 SQL 查询。
    • grader.py :它使用 LLM 评分机制来评估 SQL 查询和文本摘要的质量。
    • schema_matcher.py :它识别与查询相关的模式组件。
    • summarizer.py :它根据结构化模式或查询结果生成文本摘要。
    • utils.py :它提供用于格式化和文本处理的辅助函数。

    这些模块中的函数都是通过main.py中的协调器调用,或者通过代理工具间接调用。

  • Task-specific functional modules: The tasks/ directory contains functionally segregated logic. Each module performs a well-scoped task, which are as follows:
    • aggregator.py: It consolidates semantic results from various summarizers.
    • sql_generator.py: It constructs SQL queries using CoT prompting, conditionals, and table metadata.
    • grader.py: It uses LLM scoring mechanisms to evaluate the quality of the SQL query and textual summary.
    • schema_matcher.py: It identifies the schema components relevant to the query.
    • summarizer.py: It produces textual summaries from a structured schema or query results.
    • utils.py: It offers auxiliary helper functions for formatting and text processing.

    The functions from these modules are all invoked through the orchestrator in main.py or indirectly via agent tools.

  • 核心服务和基础设施core/文件夹包含执行核心计算任务所需的底层实用程序和后端基础设施:
    • chroma_index.py :它管理与 ChromaDB 的交互,以实现基于向量的检索。
    • embeddings.py :它使用句子转换器生成密集向量表示。
    • sql_executor.py 和 sqlite_multi_reader.py :它们提供了在多个 SQLite 实例上执行 SQL 的接口。
    • llm.py :它通过 Ollama 抽象与本地托管的 LLM 的交互。
    • cache.py :这是一个可选组件,用于缓存重复操作。
    • utils.py :用于协助处理模式、连接器和文本规范化的核心实用程序。

    这些文件主要作为后端服务运行,由更高级别的任务和代理模块调用。

  • Core services and infrastructure: The core/ folder includes low-level utilities and backend infrastructure required to perform core computational tasks:
    • chroma_index.py: It manages interactions with ChromaDB for vector-based retrieval.
    • embeddings.py: It generates dense vector representations using Sentence Transformers.
    • sql_executor.py and sqlite_multi_reader.py: They provide interfaces to execute SQL over multiple SQLite instances.
    • llm.py: It abstracts interactions with locally hosted LLMs via Ollama.
    • cache.py: It is an optional component for caching repeated operations.
    • utils.py: The core utilities that assist in handling schema, connectors, and text normalization.

    These files operate primarily as backend services that are invoked by the higher-level task and agent modules.

  • 数据初始化和持久化:管理数据加载和索引的两个模块如下:
    • seed_sqlite_data.py :使用示例零售数据为SQLite 数据库( data/sqlite1.db data/sqlite2.db )初始化。
    • setup/populate_chroma.py :它计算嵌入并将其存储到 ChromaDB 中,保存在global_index_db/目录中。

    此初始化对于在分布式数据集上实现语义搜索和查询执行至关重要。

  • Data initialization and persistence: The two modules that manage data loading and indexing are as follows:
    • seed_sqlite_data.py: Seeds SQLite databases (data/sqlite1.db, data/sqlite2.db) with sample retail data.
    • setup/populate_chroma.py: It computes and stores embeddings into ChromaDB, saved in the global_index_db/ directory.

    This initialization is vital for enabling semantic search and query execution over distributed datasets.

  • 前端用户界面:文件frontend/app.py实现了一个基本的 Streamlit 应用程序,允许用户通过自然语言输入界面与系统交互。它直接连接到main.py中的run_query()函数,并在用户界面上渲染结果,包括 SQL 输出、摘要和成绩。
  • Frontend user interface: The file frontend/app.py implements a basic Streamlit application that allows users to interact with the system through a natural language input interface. It connects directly to the run_query() function from main.py and renders results, including SQL output, summary, and grades on the UI.

为了配合系统组件的模块化分解,本节按时间顺序概述了完整的文本到 SQL 流水线,如图15.1所示。每个步骤对应一个独立的处理阶段,从用户输入和查询嵌入到 SQL 生成、评分、执行和响应交付。该图以可视化的方式抽象展示了数据和控制信号如何在系统内的代理、工具和数据库之间传播。这种由 LangChain 驱动、ChromaDB 和 SQLite 支持的分层编排机制,确保用户查询能够被上下文理解和转换。准确地将数据导入 SQL,并高效执行。以下是对图中每个编号阶段的详细说明。

To complement the modular breakdown of system components, this section presents a chronological overview of the full text-to-SQL pipeline as illustrated in the workflow Figure 15.1. Each step corresponds to a discrete processing stage, from user input and query embedding to SQL generation, grading, execution, and response delivery. The figure provides a visual abstraction of how data and control signals propagate across agents, tools, and databases within the system. This layered orchestration, driven by LangChain and supported by ChromaDB and SQLite, ensures that user queries are interpreted contextually, translated into SQL accurately, and executed efficiently. What follows is a detailed explanation of each numbered stage in the figure.

代理和工具概述

Agent and tool summary

该解决方案的主要优势在于其集成了基于 LangChain 的 ReAct 代理以及用于模式理解和语义对齐的专用工具。summarization_schema_agent能够智能地解读用户意图,并调用模式摘要和匹配工具,从而实现对各种数据库结构的稳健适应。这些工具确保代理即使在异构环境中也能保持上下文感知和模式敏感性。这种代理工具协同作用不仅减少了 SQL 生成过程中的“幻觉”,还支持用于摘要、聚合和评估的模块化插件逻辑,从而确立了系统在实现对关系型数据库的精确、可解释且可扩展的自然语言访问方面的核心优势。

The primary differentiator of this solution lies in its integration of a LangChain-based ReAct agent with specialized tools for schema understanding and semantic alignment. The summarization_schema_agent intelligently interprets user intent and invokes tools for schema summarization and matching, enabling robust adaptation across varied database structures. These tools ensure that the agent remains context-aware and schema-sensitive, even in heterogeneous environments. This agent tool synergy not only reduces hallucination in SQL generation but also allows modular plug-in logic for summarization, aggregation, and evaluation, establishing the system’s core advantage in enabling precise, explainable, and scalable natural language access to relational.

以下流程采用单个活动的 LangChain ReAct 代理以及一组已注册的独立工具来执行模式解释、SQL 生成和质量评估:

The following pipeline employs a single active LangChain ReAct agent along with a set of registered and standalone tools to perform schema interpretation, SQL generation, and quality assessment:

  • 代理:系统包含一个活动的 LangChain ReAct 代理,定义如下:

    agents/summarization_schema_agent.py

    虽然还有第二个文件agents/sql_agent.py ,但它在当前执行路径( main.py )中并未被使用。因此,整个流程中只涉及一个代理。

  • Agent: The system contains one active LangChain ReAct agent, defined in:

    agents/summarization_schema_agent.py

    While there is a second file, agents/sql_agent.py, it is not actively used in the current execution path (main.py). Thus, only one agent is involved in the pipeline.

  • 工具:有两个代理注册工具和三个附加功能工具:活动代理(summarization_schema_agent )调用两个显式的 LangChain 兼容工具:
    • 模式匹配器工具:它使用tasks/schema_matcher.py中的逻辑
    • 模式汇总工具:它使用tasks/summarizer.py中的逻辑
  • Tools: There are two agent-registered tools and three additional functional tools: the active agent (summarization_schema_agent) invokes two explicit LangChain-compatible tools:
    • Schema matcher tool: It uses logic from tasks/schema_matcher.py.
    • Schema summarizer tool: It uses logic from tasks/summarizer.py.

此外,在 LangChain 代理之外但在管道内部,main.py直接使用以下任务级工具:

Additionally, outside the LangChain agent but within the pipeline, main.py directly uses the following task-level tools:

  • 来自tasks/sql_generator.py的 SQL 生成器
  • SQL generator from tasks/sql_generator.py
  • 任务/grader.py中的 SQL 和摘要评分器
  • SQL and summary graders from tasks/grader.py
  • 来自tasks/aggregator.py 的聚合器
  • Aggregator from tasks/aggregator.py

文本转 SQL 系统的输出

Output from the text-to-SQL system

查询经过嵌入、模式匹配、摘要生成、SQL生成和评分等处理后,系统会生成多个结构化输出。这些输出完全在内存中生成,并通过Streamlit界面呈现给最终用户。这些输出既便于人理解,也便于机器验证评估。

Once the query has been processed through embedding, schema matching, summarization, SQL generation, and grading, the system produces multiple structured outputs. These outputs are generated entirely in-memory and are rendered via the Streamlit interface for the end user. The outputs serve both human interpretability and machine-verifiable evaluation.

主要输出结果如下:

The key outputs are as follows:

  • 最终汇总摘要
  • Final aggregated summary
  • 实体和数据库详细摘要
  • Detailed entity and database summary
  • 生成的 SQL 查询
  • Generated SQL query
  • SQL 查询等级
  • SQL query grade
  • 总结等级
  • Summary Grade

下图所示的最终汇总摘要表明,输出结果全面、易于理解地汇总了与用户查询语义匹配的个人信息和数据库级记录。它突出显示了具有唯一条目、跨数据库标识符以及诸如年龄城市ID等汇总属性的个人信息。重要的是,它在适用情况下将跨数据库的条目解析为统一的实体。

The final aggregated summary depicted in the following figure indicates that the output presents a comprehensive human-readable synthesis of individuals and database-level records matched semantically to the user query. It highlights individuals with unique entries, cross-database identifiers, and summarized attributes such as Age, City, and ID. Importantly, it resolves entries across databases into unified entities where applicable.

摘要文件列出了个人的姓名、年龄和所在城市,随后是两个数据库,详细列出了包含 ID 号码、姓名、年龄和位置的条目,所有内容均以项目符号的形式显示在深色背景上。

图 15.4:系统最终汇总摘要

Figure 15.4: Final aggregated summary from the system

实体和数据库详细摘要

Detailed entity and database summary

此摘要更深入地展现了底层数据结构,列举了每个数据库中的记录,并突出显示了 ID 范围和相关人员。它有助于验证模式一致性,并清晰地展示了数据的检索和规范化过程。

A deeper representation of the underlying data structure, this summary enumerates the records found in each database, highlighting ID ranges and associated individuals. It aids in verifying schema alignment and provides transparency into how data was retrieved and normalized.

下图显示了文本转 SQL 系统用户界面生成的详细实体和数据库摘要,突出显示了唯一个体、潜在的数据重复项及其源数据库记录:

The following figure shows the detailed entity and database summary generated by the text-to-SQL system’s UI, highlighting unique individuals, potential data duplicates, and their source database records:

摘要和数据表截图。摘要列出了四位人士的姓名、年龄和所在城市,其中乔治·马丁内斯有一条重复记录,年龄信息不同。下方表格显示了相同的详细信息。

图 15.5:实体和数据库详细汇总

Figure 15.5: Detailed entity and database summary

生成的 SQL 查询

Generated SQL query

该系统生成符合用户意图且语法正确的 SQL 查询语句。该查询语句基于 CoT 提示模板构建,包含了所选表、筛选列和筛选条件。它清晰地展现了用于构建查询逻辑的推理步骤。下图展示了文本转 SQL 系统用户界面生成的 SQL 查询语句分解过程,概述了查询构建的每个步骤,从意图识别到表/列选择和筛选条件构建,最终生成可执行的 SQL 语句。

The system produces a syntactically correct SQL query that corresponds to the user’s intent. Constructed using a CoT prompting template, this query encapsulates the selected table, filtered columns, and conditions. It reflects an interpretable breakdown of reasoning steps used to form the query logic. The following figure displays the generated SQL query breakdown by the text-to-SQL system UI, outlining each step in the query construction, from intent recognition to table/column selection and filter condition formulation, leading to the final executable SQL.

一张深色主题文本编辑器的屏幕截图,显示了 SQL 查询生成过程,包括识别意图、匹配表、选择列、添加筛选器以及从客户中检索 id、姓名、年龄和城市的最终查询等步骤。

图 15.6:用户界面上由文本转 SQL 系统生成的 SQL 查询。

Figure 15.6: Generated SQL query from text-to-SQL system on UI

SQL 查询等级

SQL query grade

为了验证查询质量,系统会调用一个评分工具,评估查询的正确性、相关性和执行效率。评分被分解为多个组成部分,每个部分都有相应的解释,其中包括对潜在歧义、索引效率或逻辑清晰度的观察。这种评分方式有助于提高查询的可解释性和优化查询。

To validate query quality, the system invokes a grading tool that evaluates correctness, relevance, and execution efficiency. The score is broken into components, each explained, and includes observations about potential ambiguity, indexing efficiency, or logical clarity. This grading supports explainability and query refinement.

下图展示了文本到 SQL 系统的SQL 查询评分界面,该界面从正确性、相关性和效率三个维度评估生成的查询,并为每个分数提供详细的理由。

The following figure presents the SQL Query Grade interface of the text-to-SQL system, evaluating the generated query across correctness, relevance, and efficiency dimensions, and providing a detailed justification for each score.

屏幕截图显示了对一个 SQL 查询的评估。该查询从正确性(5/5)、相关性(4.5/5)和效率(4/5)三个方面进行评分,并对每个方面都给出了详细的评论。总分为 13.5 分(满分 15 分)。

图 15.7:用户界面上来自文本转 SQL 系统的 SQL 查询评分

Figure 15.7: SQL Query Grade from text-to-SQL system on UI

总结等级

Summary Grade

系统还会对先前生成的摘要进行评估,以确保其准确性、清晰度和完整性。评分器会识别可能存在的重复内容或模式层面的遗漏(例如,条目数量不明确),并提供改进文本呈现的建议​​。最终评分确保最终用户获得可验证的见解。

The summary produced earlier is also evaluated by the system for accuracy, clarity, and comprehensiveness. The grader identifies possible duplication or schema-level omissions (e.g., entry count ambiguity) and provides suggestions for enhancing textual presentation. This final score ensures the end user receives verifiable insights.

一份总结性成绩报告,分为准确性、全面性和清晰度三个部分,每部分满分 10 分。报告中包含改进建议,全部内容均以深色背景和白色文字呈现,顶部有一个绿色对勾图标。

图 15.8:用户界面上来自文本到 SQL 系统的汇总成绩

Figure 15.8: Summary Grade from text-to-SQL system on UI

针对初始问题陈述的解决方案

Solution to the initial problem statement

该智能文本转SQL系统生成的输出能够直接有效地应对初始问题陈述中提出的挑战。在传统的零售和企业数据环境中,由于缺乏SQL专业知识,业务用户常常难以查询大型的、孤立的数据库,导致洞察延迟和错失良机。该系统通过允许用户使用自然语言与分布式异构数据集交互来弥补这一不足,同时在内部协调模式对齐、语义理解、SQL生成和验证等工作。

The outputs generated by the agentic text-to-SQL system offer a direct and effective response to the challenges outlined in the initial problem statement. In traditional retail and enterprise data environments, business users often struggle to query large, siloed databases due to a lack of SQL expertise, resulting in delayed insights and missed opportunities. This system addresses that gap by allowing users to interact with distributed, heterogeneous datasets using natural language, while internally orchestrating schema alignment, semantic understanding, SQL generation, and validation.

最终的汇总摘要和详细数据库输出使业务利益相关者能够从多个数据库中获取清晰易懂的洞察,而无需了解数据库结构或手动编写 SQL 代码。这些摘要整合了相关数据,解决了数据库中的重复条目问题,并以便于快速解读和后续决策的格式呈现了可操作的模式(例如,客户画像、城市分布等)。

The final aggregated summary and detailed database output enable business stakeholders to receive clear and human-readable insights drawn from multiple databases without having to understand their structure or write SQL manually. These summaries consolidate relevant data, resolve duplicate entries across databases, and surface actionable patterns (e.g., customer profiles, city-wise distributions) in a format suitable for rapid interpretation and downstream decision-making.

此外,生成的 SQL 查询及其评分输出具有两个至关重要的作用:首先,它们清晰地展示了系统如何将用户意图转化为结构化的数据库查询;其次,它们提供了关于正确性、相关性和效率的可验证的质量评估,从而增强了用户对自动化流程的信任。总结评分进一步确保文本输出符合清晰度、完整性和事实准确性的标准,使该解决方案适用于报告和业务用途。

Moreover, the generated SQL query and its Grading Outputs serve two vital purposes: first, they transparently show how the system translates a user’s intent into structured database queries; second, they provide verifiable quality assessments on correctness, relevance, and efficiency, instilling trust in the automated process. The Summary Grade further ensures that the textual output meets standards of clarity, completeness, and factual accuracy, making the solution suitable for reporting and business use.

这些功能共同将手动、容易出错的查询过程转变为完全自动化、可解释且可扩展的管道,使非技术团队能够实时访问数据库中的信息,并根据上下文丰富的、经过验证的信息采取果断行动。

Collectively, these capabilities transform a manual, error-prone querying process into a fully automated, explainable, and scalable pipeline—empowering non-technical teams to access insights across databases in real-time and act decisively based on context-rich, validated information.

该系统并非旨在取代数据工程师,而是增强他们的能力,并减少企业数据查询中的操作瓶颈。通过自动化执行例行且重复的 SQL 生成任务,该平台使业务用户能够独立获取洞察,从而使数据工程师能够专注于数据建模、管道优化和治理等更高层次的活动。该解决方案在不影响模式完整性、执行正确性或系统可审计性的前提下,实现了对结构化数据的民主化访问。如此一来,它既提高了各个岗位的效率,又保留了技术数据团队的关键职责和监督职能。

This system is not intended to replace data engineers but rather to augment their capabilities and reduce the operational bottlenecks in querying enterprise data. By automating routine and repetitive SQL generation tasks, the platform empowers business users to retrieve insights independently, allowing data engineers to focus on higher-order activities such as data modeling, pipeline optimization, and governance. The solution democratizes access to structured data without compromising schema fidelity, execution correctness, or system auditability. In doing so, it enhances productivity across roles while preserving the critical responsibilities and oversight provided by technical data teams.

结论

Conclusion

本章全面深入地剖析了代理式文本到 SQL 系统的内部运作机制,重点介绍了支撑其功能的设计、逻辑和输出结构。我们从 main.py 中的编排逻辑入手,探讨了系统如何按顺序调用模式摘要、聚合、SQL 生成和质量控制等步骤。采用模块化、工具驱动的代理流水线进行评分。集成一个配备模式感知工具的 ReAct 式 LangChain 代理,构成了智能查询解释和响应生成的核心。

This chapter has provided a comprehensive walkthrough of the inner workings of an agentic text-to-SQL system, highlighting the design, logic, and output structure that underpin its functionality. Beginning with the orchestration logic in main.py, we examined how the system sequentially invokes schema summarization, aggregation, SQL generation, and quality grading using a modular, tool-driven agent pipeline. The integration of a single ReAct-style LangChain agent, equipped with schema-aware tools, forms the backbone of intelligent query interpretation and response generation.

针对特定任务的模块确保了职责的清晰划分,每个组件(例如摘要器​​、SQL 生成器和评分器)都执行不同的、可验证的角色。核心基础架构模块支持基于向量的检索、LLM 交互和多数据库 SQL 执行。ChromaDB 和 SQLite 的协同使用实现了跨结构化数据源的可扩展且语义丰富的查询。

The task-specific modules ensure clear separation of responsibilities, with each component, such as the summarizer, SQL generator, and grader, performing distinct, verifiable roles. Core infrastructure modules provide support for vector-based retrieval, LLM interaction, and multi-database SQL execution. The use of ChromaDB and SQLite in tandem enables scalable and semantically enriched querying across structured data sources.

该系统的输出,包括最终摘要、分级 SQL 和注重可解释性的反馈,证明了其对技术和非技术利益相关者的可用性。通过利用智能规划、CoT 提示和本地 LLM,该架构平衡了透明度、适应性和性能。因此,它为在企业环境中部署文本到 SQL 系统提供了一个务实的蓝图,在这些环境中,精确性、模式一致性和实时反馈至关重要。

The system’s outputs, including final summaries, graded SQL, and interpretability-focused feedback, demonstrate its usability for both technical and non-technical stakeholders. By leveraging agentic planning, CoT prompting, and local LLMs, the architecture balances transparency, adaptability, and performance. In doing so, it represents a pragmatic blueprint for deploying text-to-SQL systems in enterprise environments where precision, schema alignment, and real-time feedback are essential.

下一章,我们将讨论如何将光学字符识别( OCR ) 与生成式人工智能( GenAI ) 相结合,以构建智能管道,将图像转换为可操作的搜索见解。

In the next chapter, we will discuss integration of optical character recognition (OCR) with generative AI (GenAI) to build intelligent pipelines that convert images into actionable search insights.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第十六GenAI用于从图像中提取文本

CHAPTER 16GenAI for Extracting Text from Images

介绍

Introduction

本章将探讨如何将光学字符识别( OCR ) 与生成式人工智能( GenAI ) 相结合,构建智能流程,将图像转化为可操作的搜索洞察。目标是从图像(例如产品照片、广告或目录截图)中提取有意义的文本信息,并利用这些信息指导决策、产品发现或搜索重定向。

In this chapter, we will explore the integration of optical character recognition (OCR) with generative AI (GenAI) to build intelligent pipelines that convert images into actionable search insights. The goal is to extract meaningful textual information from images, such as product photos, advertisements, or catalog screenshots, and use that information to guide decision-making, product discovery, or search redirection.

我们首先利用 EasyOCR,这是一个基于 Python 的 OCR 库,能够高精度地检测图像中的文本。提取文本后,将其传递给本地部署的轻量级大型语言模型( LLM )(通过 Ollama 实现),以生成自然语言搜索查询。该查询模拟了人们在亚马逊FlipkarteBay等热门购物平台上搜索类似或更佳商品的方式

We begin by leveraging EasyOCR, a Python-based OCR library that provides high-accuracy text detection in images. Once text is extracted, it is passed to a lightweight large language model (LLM) hosted locally via Ollama, to generate a natural language search query. This query reflects how a human might search for similar or better alternatives on popular shopping platforms like Amazon, Flipkart, or eBay.

该流程随后执行统一资源定位符( URL ) 重定向,以模拟在这些平台上的搜索,或使用轻量级爬虫抓取部分页面内容。提取的片段再次使用 LLM 进行汇总,以便为用户提供快速的对比概览,展示类似产品、优惠或价格趋势。

The pipeline then performs Uniform Resource Locator (URL) redirection to simulate searches on these platforms or fetches partial page content using lightweight scraping. The extracted snippets are summarized again using the LLM to provide users with a quick comparative overview, showcasing similar products, offers, or pricing trends.

该架构(如图 16.1所示)是模块化的、可解释的、可本地部署的,因此非常适合构建 GenAI 购物助手或可视化产品比较工具。

The architecture (illustrated in Figure 16.1) is modular, interpretable, and deployable locally, making it ideal for building GenAI shopping assistants or visual product comparison tools.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 基于GenAI的OCR的三种方法
  • Three approaches to GenAI-based OCR
  • 图像OCR识别
  • OCR on image
  • 对多模态文档进行光学字符识别
  • OCR on a multimodal document
  • 待办事项
  • To do

目标

Objectives

本章旨在使读者掌握使用高级多模态技术进行光学字符识别 (OCR) 的知识和实践技能。读者将学习如何使用能够解释视觉和文本数据的基础模型,从图像和便携式文档格式( PDF ) 文件中提取文本。本章介绍了用于文档理解的 Mistral OCR应用程序编程接口( API ),并重点阐述了其在智能流程中的集成。特别强调了从包含表格数据的收据中提取结构化信息,以便进行后续分析。学习完本章后,读者将能够为各种实际格式和布局构建强大的 OCR 系统。

The objective of this chapter is to equip readers with the knowledge and practical skills to perform OCR using advanced multimodal techniques. Readers will learn how to extract text from images and Portable Document Format (PDF) using foundation models capable of interpreting both visual and textual data. The chapter introduces the Mistral OCR application programming interface (API) for document understanding and highlights its integration into intelligent pipelines. Special emphasis is placed on extracting structured information from receipts with tabular data, enabling downstream analysis. By the end of the chapter, readers will be able to build robust OCR systems for diverse real-world formats and layouts.

基于GenAI的OCR的三种方法

Three approaches to GenAI-based OCR

在构建处理和理解视觉输入的智能系统中,光学字符识别 (OCR) 仍然是一项基础性功能。随着对无缝解读图像文本的需求不断增长,利用传统机器学习( ML )、基于 Transformer 的语言模型和多模态推理系统执行 OCR 的技术也在不断发展。以下部分将介绍并对比 GenAI 环境下的三种不同的 OCR 方法:将独立的 OCR 引擎集成到 GenAI 工作流程中、使用原生训练的语言模型 (LLM) 执行 OCR,以及采用能够直接进行图像到文本转换的多模态语言模型。

In the context of building intelligent systems that process and understand visual input, OCR remains a foundational capability. As the demand for seamless interpretation of image-based text grows, so does the evolution of techniques to perform OCR using traditional machine learning (ML), transformer-based language models, and multimodal reasoning systems. The following section introduces and contrasts three distinct approaches to OCR in a GenAI context, which are wrapping standalone OCR engines within GenAI workflows, using LLMs natively trained to perform OCR, and employing multimodal LLMs capable of direct image-to-text comprehension:

  • OCR基础模型:第二种方法利用基于Transformer的语言模型,这些模型在文本和视觉数据上进行端到端训练,并将OCR作为嵌入式功能。这些模型旨在直接接收图像像素并输出识别出的文本,无需依赖外部OCR引擎。虽然此类模型通常是专有的(例如,Google的TrOCRDeepMind的Flamingo变体和Mistral OCR ),但它们具有更高的集成度,并且在处理噪声或非结构化输入时可能优于传统的OCR方法。这种端到端的学习能力简化了流程,但牺牲了模块化和灵活性。它非常适合需要对高度可变的输入进行OCR的场景,例如手写笔记、扫描文档或噪声较大的屏幕截图,在这些场景中,传统的OCR工具可能会表现不佳。

    注意:Mistral OCR 是一个专用的 OCR 基础模型,并非简单的实用程序封装或插件。它被设计成一个功能强大的基础模型,专为 OCR 和复杂文档理解任务而构建。主要细节包括:

    • 文档理解架构:Mistral OCR 旨在全面理解文档,包括媒体、文本、表格和数学表达式,为下游应用程序提供准确且结构化的输出
    • 独立 API 和 SDK:可通过公开访问的 Mistral-OCR-latest API 获取,并由官方 Python SDK 提供支持。它作为基础模型可在 Vertex AI 或 Azure AI Foundry 等平台上运行。
    • 领先业界的准确率:在内部测试中,Mistral OCR 的性能优于 Google Document AI、Azure OCR 和 Gemini 模型,在表格、公式和多语言文本等具有挑战性的内容上实现了 97-99% 的准确率。
    • 企业级:支持批量处理(在单节点 GPU 上每分钟最多可处理 2,000 页)、结构化输出、多语言功能,并且可以在企业内部运行。
    • Mistral 提供两种类型的模型,分别是开放型模型和高级型模型。
  • OCR foundation model: The second approach leverages the capabilities of transformer-based language models that are trained end-to-end on both textual and visual data, with OCR as an embedded function. These models are designed to ingest image pixels directly and output recognized text, without relying on an external OCR engine. While such models are often proprietary (e.g., Google’s TrOCR or DeepMind’s Flamingo variants and Mistral OCR), they offer higher integration and may outperform traditional OCR methods on noisy or unstructured inputs. This end-to-end learning capability simplifies the pipeline but sacrifices modularity and flexibility. It is well-suited to scenarios that demand OCR on highly variable inputs, such as handwritten notes, scanned documents, or noisy screenshots, where conventional OCR tools may degrade.

    Note: Mistral OCR is a dedicated OCR foundation model, not just a utility wrapper or plug-in. It is designed to be a powerful base model purpose-built for OCR and complex document understanding tasks. Key details include the following:

    • Document understanding architecture: Mistral OCR is built to fully comprehend documents, including media, text, tables, and mathematical expressions, delivering accurate and structured outputs for downstream applications.
    • Standalone API and SDK: It is available via a publicly accessible Mistral-OCR-latest API, supported by an official Python SDK. It runs as a foundation model across platforms like Vertex AI or Azure AI Foundry.
    • Benchmark-leading accuracy: In internal tests, Mistral OCR outperforms Google Document AI, Azure OCR, and Gemini models, achieving 97–99% accuracy across challenging content like tables, equations, and multilingual text.
    • Enterprise-grade: It supports batch processing (up to 2,000 pages per minute on a single-node GPU), structured output, multilingual capabilities, and can be run on-premises for enterprise use.
    • Mistral provides two types of models, which are open models and premier models.
  • 将传统 OCR 工具集成到 GenAI 流水线中:第一种方法,也是本章的重点,是将 EasyOCR 等高性能 OCR 引擎集成到支持 GenAI 的处理流水线中。EasyOCR 是一个轻量级的开源库,它利用深度学习来检测和识别图像中的文本。该组件专门负责文本提取。提取完成后,文本通常会通过 API 调用传递给本地或托管的语言模型,以生成语义有意义的解释,例如搜索查询或摘要。这种方法具有模块化和可解释性,因为 OCR 和语言模型 (LLM) 组件是分离的,允许对每个组件进行独立调整、优化或替换。它非常适合对图像质量、语言控制和本地部署有严格要求的应用。
  • Wrapping a traditional OCR tool within a GenAI pipeline: The first approach, and the primary focus of this chapter, involves integrating a high-performance OCR engine such as EasyOCR into a GenAI-enabled processing pipeline. EasyOCR is a lightweight, open-source library that uses deep learning to detect and recognize text in images. This component is responsible solely for text extraction. Once extracted, the text is passed into a local or hosted language model, typically through an API call, to generate semantically meaningful interpretations, such as search queries or summaries. This method is modular and interpretable as the OCR and LLM components are separated, allowing each to be tuned, optimized, or replaced independently. It is ideal for applications where image quality, language control, and local deployment constraints are critical.
  • 针对 OCR 任务微调多模态语言学习模型:第三种方法使用多模态语言学习模型,例如大型语言和视觉助手( LLaVA ) 或MetaLlama3.2 Vision ,该模型经过训练,能够联合处理视觉和文本输入。针对 OCR 任务微调的多模态模型结合了图像的视觉理解能力和专门的文本识别能力,并通过在 OCR 特定数据集上进行针对性训练进行优化。与将 OCR 视为次要技能的通用多模态模型不同,微调使视觉编码器和语言头保持一致,从而能够准确地检测、转录和解释各种布局、字体和语言的文本。这种方法保留了模型对周围视觉上下文(例如图表、表格等)进行推理的能力。用户界面( UI ) 元素,同时显著提高提取精度。最终形成一个统一的系统,可在单个推理步骤中执行高保真度 OCR 和上下文解释,从而降低流程复杂性。
  • Fine-tuning a multimodal LLM for OCR tasks: The third approach uses a multimodal LLM like Large Language and Vision Assistant (LLaVA) or Meta Llama3.2 vision that is trained to reason jointly over visual and textual inputs. A fine-tuned multimodal model for OCR combines the visual understanding of images with specialized text recognition capabilities, optimized through targeted training on OCR-specific datasets. Unlike generic multimodal models, which treat OCR as a secondary skill, fine-tuning aligns the visual encoder and language head to accurately detect, transcribe, and interpret text in diverse layouts, fonts, and languages. This approach preserves the model’s ability to reason about surrounding visual context, such as diagrams, tables, or user interface (UI) elements, while significantly improving extraction accuracy. The result is a unified system that performs high-fidelity OCR and contextual interpretation in a single inference step, reducing pipeline complexity.

这些方法分别代表了模块化、泛化性和系统复杂性这三个维度上的不同侧重点。具体选择哪种方法取决于延迟、基础设施、模型可用性和可解释性要求等约束条件。本章重点介绍第一种方法,即将 EasyOCR 封装到 GenAI 流水线中,因为它简单有效,并且适用于本地部署的智能体。

Each of these approaches reflects a different point on the spectrum of modularity, generalization, and system complexity. The approach chosen will depend on constraints such as latency, infrastructure, model availability, and interpretability requirements. In this chapter, we focus on the first method, wrapping EasyOCR within a GenAI pipeline due to its simplicity, effectiveness, and suitability for locally-deployed intelligent agents.

下图展示了三种不同的基于 GenAI 的 OCR 集成策略,从独立的 OCR 基础模型到使用 API 封装的传统 OCR 的模块化管道,再到统一 OCR 的精细调整的多模态 LLM:

The following figure illustrates three distinct GenAI-based OCR integration strategies, ranging from standalone OCR foundation models to modular pipelines using traditional OCR wrapped in APIs to fine-tuned multimodal LLMs that unify OCR:

该图比较了三种用于图像和文本分析的人工智能工作流程:基于 LLM 的 OCR、基于 FastAPI 的 OCR 结合 LLM,以及多模态 LLM 方法。每种工作流程都以图像或图文对开始,并以结果框结束。

图 16.1:OCR 集成方法的比较

Figure 16.1: Comparison of OCR integration approaches

购物助手用例

Shopping assistance use case

在电子商务和数字市场主导的时代,消费者常常面临海量的产品选择,每种产品都伴随着不同的规格、品牌和价格。虽然在线平台提供了丰富的搜索界面,但用户仍然经常依赖图片、朋友的截图、商店展示照片或社交媒体帖子来表达他们的意图,如图16.2所示。对许多人来说,手动搜索每个产品详情的传统方法既繁琐又低效:

In an era dominated by e-commerce and digital marketplaces, consumers are often faced with an overwhelming number of product choices, each accompanied by varying specifications, brands, and price points. While online platforms provide rich search interfaces, users frequently rely on images, screenshots from friends, photographs of store displays, or social media posts to express their intent, as explained in Figure 16.2. For many, the traditional approach of manually searching for each product detail is cumbersome and inefficient:

流程图显示了耳机图像的输入、OCR 提取产品文本、LLM 生成搜索查询、从亚马逊、Flipkart 和 eBay 获取信息,然后汇总结果并显示给用户。

图 16.2:GenAI OCR 流程

Figure 16.2: The GenAI OCR pipeline

假设一位用户截取了一张耳机广告的屏幕截图,广告中显示了品牌、技术规格和折扣信息。该用户想知道在相同价位范围内,其他知名品牌(例如索尼JBL)是否有更好的替代品。然而,图片中的文字无法复制,手动搜索又非常耗时。这时,智能视觉助手就显得尤为重要。

Consider a user who captures a screenshot of a headphone advertisement showing a brand, technical specs, and a discount. The user wants to know whether better alternatives are available within the same price range from other trusted brands like Sony or JBL. However, the text in the image cannot be copied, and searching manually is time-consuming. This is where an intelligent visual assistant becomes invaluable.

在本用例中,我们介绍了一种结合了光学字符识别 (OCR) 和 GenAI 系统的流程,以实现整个发现过程的自动化。该流程首先使用 OCR 从图像中提取相关的产品信息,包括产品名称、规格(例如,3.5 毫米插孔、麦克风支持、线缆长度)、价格和折扣。然后,提取的文本被传递给本地语言学习模型 (LLM)(通过 Ollama),该模型会生成一个自然语言查询,模拟真实用户在线搜索替代产品的方式。

In this use case, we introduce a pipeline that combines OCR with a GenAI-powered system to automate the entire discovery process. The pipeline begins by extracting relevant product information from the image using OCR. This could include product names, specifications (e.g., 3.5mm jack, mic support, length of cable), pricing, and discounts. The extracted text is then passed to a local LLM (via Ollama), which generates a natural language query that mimics how a real user might search for alternatives online.

该系统并非直接显示原始 OCR 文本,而是模拟在亚马逊、Flipkart 和 eBay 等多个电商平台上的搜索结果。它获取这些搜索结果或产品信息片段,并使用相同的 LLM 进行汇总。最终用户然后,系统会提供市场上各种替代方案的简洁明了且具有背景信息的比较,而无需打开多个网站或进行手动研究。

Rather than just displaying the raw OCR text, the system simulates search results on multiple e-commerce platforms such as Amazon, Flipkart, and eBay. It fetches these search results or snippets of product information and summarizes them using the same LLM. The end user is then presented with a concise and contextual comparison of alternatives available in the market, without needing to open multiple websites or conduct manual research.

这种方法显著提升了偏好视觉输入、预算有限或希望寻找更智能购物方式而无需重复搜索的用户的购物体验。对于价格敏感型市场和移动优先用户而言,这种方法尤其有利,因为他们通常使用屏幕截图和社交媒体来了解产品信息。

This approach significantly improves the shopping experience for users who prefer visual input, are on a budget, or are looking for smarter alternatives without investing time in repetitive searches. It is especially beneficial for price-sensitive markets and mobile-first users who often use screenshots and social media as their primary mode of capturing product interest.

最终,该用例展示了如何将 OCR 与 GenAI 系统相结合,以弥合非结构化视觉输入和结构化、可操作的洞察之间的差距,从而为智能的多模态消费者工具铺平道路。

Ultimately, this use case demonstrates how combining OCR with GenAI-enabled systems to bridge the gap between unstructured visual input and structured, actionable insight, paving the way for intelligent, multimodal consumer tools.

图像OCR识别

OCR on image

利用LLM的OCR技术将传统的图像文本提取转化为语义丰富的理解任务。与仅转录可见字符的传统OCR不同,基于LLM的OCR可以解释布局、推断结构并为提取的内容赋予上下文信息。这种方法能够智能地提取文档中的标题、表格、标签和关系。当与图16.1所示的GenAI流程结合使用时,该系统可以返回Markdown或结构化输出,甚至可以回答有关内容的问题。这为自动化文档分析、数字归档、合规性和视觉数据驱动的决策等工作流程解锁了强大的功能。图16.3展示了Storm有线耳机的产品列表,呈现了一个丰富的多模态数据示例,其中视觉、文本和语义元素相互交织。它包含产品照片、描述性元数据(例如,技术规格、用户评分和价格)以及诸如受欢迎程度和折扣详情等上下文线索。对于支持OCR的GenAI系统而言,这张图片不仅仅是提取文本;它旨在理解产品相关性,解析层级属性(例如品牌、功能、价格、优惠),并将它们映射到可执行的输出,例如搜索查询或结构化记录。这种多模态输入非常适合结合视觉语言模型( VLM ) 和智能文本提取的流程,从而实现更智能的购物助手或产品推荐引擎。

OCR using LLMs transforms traditional text extraction from images into a semantically rich understanding task. Unlike conventional OCR, which only transcribes visible characters, LLM-based OCR can interpret layout, infer structure, and contextualize extracted content. This approach allows for intelligent extraction of headings, tables, labels, and relationships across the document. When combined with the GenAI pipeline shown in Figure 16.1, the system can return markdown or structured outputs and even answer questions about the content. This unlocks powerful capabilities for automating workflows in document analysis, digital archiving, compliance, and visual data-driven decision-making. Figure 16.3 showcases a product listing for the Storm Wired Headphone, presenting a rich example of multimodal data where visual, textual, and semantic elements are intertwined. It contains a product photo, descriptive metadata (e.g., technical specs, user ratings, and pricing), and contextual cues such as popularity and discount details. For an OCR-enabled GenAI system, this image is not just about extracting text; it is about understanding product relevance, parsing hierarchical attributes (e.g., brand, features, price, offer), and mapping them to actionable outputs like search queries or structured records. Such multimodal inputs are ideal for pipelines that combine vision-language models (VLMs) and intelligent text extraction to enable smarter shopping assistants or product recommendation engines.

黑色头戴式耳机,配有灰色软垫耳罩、可调节头带和 3.5 毫米音频线。产品页面显示,该耳机获得 2336 条评价,平均评分为 4 星,并标明内置麦克风和 1.5 米长的耳机线。

图 16.3:OCR 运行示例

Figure 16.3: An example where we can run OCR

购物协助

Building shopping assistance

让我们来了解一下这个项目的文件夹结构。下图概述了一个支持 OCR 的 GenAI 流程的模块化结构,该流程旨在用于视觉产品发现。工作流程从放置在assets/文件夹中的输入图像开始,该图像通过image_utils.py处理,并使用 EasyOCR 提取文本。然后,原始文本通过search_utils.py和本地 LLM 转换为便于搜索的查询。搜索 URL 由web_scraper.py生成,并用于获取实时产品摘要。接下来的摘要由summarizer.py进行汇总,同样利用了 LLM。整个流程由main.py协调,从而提供了一个完全本地化且可解释的图像到洞察系统。

Let us understand the folder structure of this project. The following figure outlines the modular structure of an OCR-enabled GenAI pipeline designed for visual product discovery. The workflow begins with an input image placed in the assets/ folder, which is processed using image_utils.py to extract text via EasyOCR. This raw text is converted into a search-friendly query using a local LLM via search_utils.py. Search URLs are generated by web_scraper.py and used to fetch real-time product snippets. The following snippets are summarized using summarizer.py, again leveraging an LLM. The entire pipeline is orchestrated through main.py, offering a fully local and interpretable image-to-insight system.

项目文件夹结构以树状图形式展示,包含 Python 文件和一个资源文件夹。每个文件旁边的注释描述了其用途,包括 OCR 逻辑、搜索查询生成、网页抓取和图像存储。

图 16.4:代码文件夹结构

Figure 16.4: Code folder structure

架构概述

Architecture overview

该系统采用模块化架构处理基于图像的输入,并将其转化为可操作的购物情报。该流程旨在接收与产品相关的图像,例如零售包装盒的照片、聊天截图或放置在assets/文件夹中的促销横幅。随后,系统使用 OCR 工具(EasyOCR)分析图像,提取可见文本。提取的原始文本通过 Ollama 传递给本地 LLM,后者会生成用户可能在 Flipkart 或 Amazon 上输入的真实搜索查询。该查询用于构建真实的电商搜索 URL。最后,系统抓取并汇总这些 URL 中的产品列表,以提供类似或更佳替代品的概览。该流程完全在本地运行,因此适用于需要保护隐私或离线使用的场景。

This system follows a modular architecture for processing image-based inputs and turning them into actionable shopping intelligence. The pipeline is designed to accept a product-related image, such as a photo of a retail box, a screenshot from a chat, or a promotional banner placed into the assets/ folder. From there, the image is analyzed using an OCR tool (EasyOCR) to extract visible text. The resulting raw text is passed into a local LLM via Ollama, which generates a realistic search query a user might type on Flipkart or Amazon. That query is used to construct real e-commerce search URLs. Finally, product listings from these URLs are scraped and summarized to provide an overview of similar or better alternatives. The pipeline runs entirely locally, making it useful for privacy-preserving or offline scenarios.

该架构采用模块化设计,由以下五个关键组件构成:

The architecture is modular and composed of five key components, which are as follows:

  • 图像处理(image_utils.py):它使用 EasyOCR 从产品图像中提取原始文本。
  • Image processing (image_utils.py): It extracts raw text from product images using EasyOCR.
  • 查询生成(search_utils.py):它将 OCR 文本发送到本地 LLM(通过 Ollama)以生成用户风格的搜索查询。
  • Query generation (search_utils.py): It sends the OCR text to a local LLM (via Ollama) to generate a user-style search query.
  • 搜索重定向(web_scraper.py):它使用查询构建亚马逊、Flipkart 和 eBay 的产品搜索 URL。
  • Search redirect (web_scraper.py): It constructs product search URLs for Amazon, Flipkart, and eBay using the query.
  • Web 摘要(summarizer.py):它从这些 URL 中抓取片段,并使用 LLM 总结产品趋势。
  • Web summary (summarizer.py): It scrapes the snippets from those URLs and summarizes the product trends using the LLM.
  • 编排(main.py):它负责编排整个管道并将结果打印给用户。
  • Orchestration (main.py): It orchestrates the full pipeline and prints results to the user.

完整的代码可以在 GitHub 代码库中找到。

The end-to-end code can be found in the GitHub repository.

该流程依赖于几个关键的 Python 库,详见requirements.txt 文件。首先,easyocr (以及torchtorchvision 负责从图像中提取文本。pillow支持图像加载和预处理(如有需要)。ollama包是本地托管的 LLM(例如 Llama 3 的接口,使您无需依赖云 API 即可生成搜索查询和摘要。Web 访问由requestsbeautifulsoup4管理,它们用于轻量级地抓取产品列表。seleniumgoogle-search-results等可选包已列出,但在此版本中未积极使用,为未来扩展动态抓取或基于 SerpAPI 的 Google 搜索集成留出了空间。总而言之,这些库的要求极低,使系统具有良好的可移植性和离线兼容性,如下图所示:

The pipeline depends on a few critical Python libraries, as outlined in requirements.txt. First, easyocr (along with torch and torchvision) powers the text extraction from images. pillow supports image loading and preprocessing, if needed. The ollama package is the interface to locally hosted LLMs like Llama 3, enabling you to generate search queries and summaries without relying on cloud APIs. Web access is managed by requests and beautifulsoup4, which are used for lightweight scraping of product listings. Optional packages like selenium and google-search-results are listed but not actively used in this version, providing room for future expansion with dynamic scraping or SerpAPI-based Google search integration. Overall, the requirements are minimal and keep the system portable and offline-friendly, as shown in the following figure:

包含版本号和简要说明的 Python 包列表,包括 easyocr、ollama、pillow、requests、beautifulsoup4、selenium、torch、torchvision 和 google-search-results。

图 16.5:Requirement.txt 快照,可在运行整个代码之前运行。

Figure 16.5: Requirement.txt snapshot, which can be run before running the entire code

以下部分阐述了解决方案的整体流程。首先,它使用 OCR 从图像中提取文本;然后,通过语言学习模型 (LLM) 将文本转换为自然语言搜索查询;接着,从电商平台获取产品列表;最后,通过另一次 LLM 调用汇总搜索结果。该设计注重清晰性、可追溯性和优雅的错误处理,使其能够稳健地应用于实际场景。

The following section explains the overall flow of the solution. It begins by extracting text from an image using OCR, converts that text into a natural language search query via a LLM, fetches product listings from e-commerce platforms, and finally summarizes the results using another LLM call. The design emphasizes clarity, traceability, and graceful error handling, making it robust for real-world use.

1. 使用 EasyOCR 进行 OCR 识别——从图像中提取文本:流程中的第一个主要步骤是使用 EasyOCR 从图像中读取并提取文本信息。此逻辑在image_utils.py 文件中实现,其中初始化了一个预训练的 OCR 模型,该模型支持英语语言,并配置为基于 CPU 的推理。核心函数extract_text_from_image(image_path)读取图像并返回一个由识别出的单词组成的字符串。例如,如果图像显示“boAt 有线耳机499”,OCR 引擎会将其作为纯文本返回。这一步骤至关重要,因为它将非结构化的视觉数据转换为下游组件(例如 LLM)可以理解和推理的结构化格式。

1. OCR with EasyOCR—extracting text from images: The first major step in the pipeline involves using EasyOCR to read and extract textual information from an image. This logic is implemented in image_utils.py, where a pre-trained OCR model is initialized with English language support and configured for CPU-based inference. The core function extract_text_from_image(image_path) reads the image and returns a joined string of recognized words. For example, if the image says boAt Wired Earphones 499, the OCR engine will return that as plain text. This step is crucial because it translates unstructured visual data into a structured format that downstream components (like the LLM) can understand and reason over.

a. 使用 EasyOCR 处理图像:

a. The image is processed using EasyOCR:

reader = easyocr.Reader(['en'], gpu=False)

reader = easyocr.Reader(['en'], gpu=False)

results = reader.readtext(image_path, detail=0)

results = reader.readtext(image_path, detail=0)

此函数从图像中提取纯文本字符串。例如,一张产品包装盒的图像可能会返回JBL 有线耳机带麦克风799”。

This extracts plain text strings from an image. For example, an image of a product box might return,JBL Wired Headphones with Mic 799.

2. 通过 LLM 生成查询——将文本转换为意图:从图像中提取文本后,会将其传递给托管在 Ollama 上的本地 LLM。search_utils.py 中的 generate_search_query(ocr_text) 函数生成一个提示,要求模型将 OCR 文本转换为用户友好的搜索短语,就像您在 Flipkart 上输入以查找类似或更佳产品的短语一样。例如,如果提取的文本是boAt 3.5mm 有线耳机799 ,LLM 可能会返回“带麦克风的有线耳机,价格低于 800”。这一步骤弥合了原始图像内容和可用于搜索的意图之间的差距。这是一个简单而强大的示例,展示了 LLM 如何解释模糊的输入并将其上下文关联到特定任务。

2. Query generation via LLM—converting text into intent: Once the text has been extracted from the image, it is passed to the local LLM hosted via Ollama. The function generate_search_query(ocr_text) in search_utils.py constructs a prompt asking the model to convert the OCR text into a realistic, user-friendly search phrase, something you would type into Flipkart to discover similar or better products. For example, if the extracted text is boAt 3.5mm Wired Headphones 799, the LLM might return wired headphones with a mic under 800. This step bridges the gap between raw image content and search-ready intent. It is a simple but powerful example of how LLMs can interpret ambiguous input and contextualize it for specific tasks.

a. 提取的文本被传递给LLM的提示:

a. The extracted text is passed into a prompt for the LLM:

prompt = f"以下产品文本是从图像中提取的:\n\n{ocr_text}..."

prompt = f"The following product text was extracted from an image:\n\n{ocr_text}..."

response = ollama.chat(model="llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": prompt}])

response = ollama.chat(model="llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": prompt}])

b. LLM 返回一个简化的搜索词,例如:

b. The LLM returns a simplified search phrase like:

800元以下的有线耳机,带麦克风

wired headphones with mic under 800

3. URL 构建——仅重定向购物链接: web_scraper.py模块不执行基于 API 的产品检索,而是为亚马逊、Flipkart 和 eBay 等主流平台构建直接搜索 URL。这是通过使用urllib.parse.quote_plus进行简单的字符串编码和动态 URL 模板来实现的。函数get_product_listings(query)接收生成的搜索查询,并将其插入到每个平台的相应搜索 URL 结构中。例如,像“1000 元以下的无线耳机”这样的查询将变为https://www.amazon.in/s?k=wireless+earbuds+under+1000。这种设计选择使得该流程独立于 API,能够适应平台变化,并且部署速度快。

3. URL construction—redirect-only shopping links: Instead of performing API-based product retrieval, the web_scraper.py module builds direct search URLs for major platforms like Amazon, Flipkart, and eBay. This is achieved through simple string encoding using urllib.parse.quote_plus and dynamic URL templating. The function get_product_listings(query) takes the generated search query and inserts it into the appropriate search URL structure for each platform. For example, a query like wireless earbuds under 1000 will become https://www.amazon.in/s?k=wireless+earbuds+under+1000. This design choice allows the pipeline to be API-independent, robust to platform changes, and fast to deploy.

a. 您的系统不调用 API,而是构建直接搜索 URL:

a. Instead of calling APIs, your system constructs direct search URLs:

i. f"https://www.amazon.in/s?k={encoded_query}"

i. f"https://www.amazon.in/s?k={encoded_query}"

ii. f"https://www.flipkart.com/search?q={encoded_query}"

ii. f"https://www.flipkart.com/search?q={encoded_query}"

iii. f"https://www.ebay.com/sch/i.html?_nkw={encoded_query}"

iii. f"https://www.ebay.com/sch/i.html?_nkw={encoded_query}"

  这样就可以根据生成的查询重定向到真实的商品列表。

  This allows redirection to real product listings based on the generated query.

4. 网页摘要提取与汇总:准备好搜索 URL 后,系统使用requests 库获取每个产品列表页面的 HTML 内容。在summarizer.py 文件中,函数fetch_page_snippet(url)会扫描页面,并从常见的 HTML 标签(例如<a>、<div> 和<span> 提取易于理解的产品相关摘要(例如标题、描述、价格) 。然后,LLM 模型会使用第二个提示对这些摘要进行汇总,该提示会要求模型提取主题、关键词和定价模式。函数summarize_product_pages(product_listings)会遍历所有搜索结果,从每个结果中提取摘要,并返回一组易于理解的摘要。 每个门店一份。这一步骤通过提供综合概览而非直接堆砌原始文本,提升了用户体验。

4. Web snippet extraction and summarization: With the search URLs ready, the system fetches the HTML content of each product listing page using requests. In summarizer.py, the function fetch_page_snippet(url) scans the page and collects readable product-related snippets (e.g., titles, descriptions, prices) from common HTML tags like <a>, <div>, and <span>. These snippets are then summarized by the LLM using a second prompt that asks the model to extract themes, keywords, and pricing patterns. The function summarize_product_pages(product_listings) loops over all search results, fetches snippets from each, and returns a set of human-readable summaries, one for each store. This step elevates the user experience by providing a synthesized overview rather than dumping raw text.

a. 使用requests + BeautifulSoup从每个网站获取基本文本片段

a. Basic text snippets are fetched from each site using requests + BeautifulSoup:

soup = BeautifulSoup(response.text, 'html.parser')

soup = BeautifulSoup(response.text, 'html.parser')

for tag in soup.find_all(['a', 'div', 'span'], limit=100):

for tag in soup.find_all(['a', 'div', 'span'], limit=100):

snippets.append(tag.get_text(strip=True))

snippets.append(tag.get_text(strip=True))

b. 然后,LLM 对这些片段进行总结:

b. Then, the snippets are summarized by LLM:

prompt = f"以下是来自 {site_name} 的一些产品列表:\n\n{joined_text}"

prompt = f"The following are some product listings from {site_name}:\n\n{joined_text}"

response = ollama.chat(...)

response = ollama.chat(...)

c. 这将返回如下摘要:常见商品包括 boAt、JBL 和 Sony,价格低于1000卢比,带麦克风和防缠绕线缆。

c. This returns summaries like: common listings include boAt, JBL, and Sony under 1,000 with mic and tangle-free cables.

5. 通过 main.py 实现完整的流水线编排main.py脚本作为整个系统的入口点和编排器。它首先扫描assets/文件夹以查找第一个可用的图像。然后,使用extract_text_from_image()处理该图像,并将生成的文本通过generate_search_query()转换为搜索查询,最后将该查询传递给get_product_listings()以生成购物链接。最后,调用summarize_product_pages()来获取、解析和汇总产品数据。整个脚本都使用日志记录来跟踪进度和错误,使系统易于调试和维护。执行时,脚本会打印出原始列表和 LLM 生成的摘要,使用户能够了解网上有哪些类似产品。

5. Full pipeline orchestration via main.py: The main.py script acts as the entry point and orchestrator for the entire system. It first scans the assets/ folder to find the first available image. This image is processed by extract_text_from_image(), the resulting text is transformed into a search query by generate_search_query(), and then the query is passed into get_product_listings() to generate shopping links. Finally, summarize_product_pages() is called to fetch, parse, and summarize the product data. Logging is used throughout the script to track progress and errors, making the system easy to debug and maintain. When executed, the script prints out both the raw listings and LLM-generated summaries, offering the user insight into what similar products are available online.

以下Python脚本定义了一个模块化管道,该管道可根据视觉输入自动执行产品搜索和摘要:

The following Python script defines a modular pipeline that automates product search and summarization based on visual input:

1. 导入和设置:加载所有模块化组件,如 OCR、查询生成、网络抓取和摘要工具:

1. Imports and setup: It loads all modular components like OCR, query generation, web scraping, and summarization utilities:

导入操作系统

import os

导入日志

import logging

from image_utils import extract_text_from_image

from image_utils import extract_text_from_image

from search_utils import generate_search_query

from search_utils import generate_search_query

from web_scraper import get_product_listings

from web_scraper import get_product_listings

from summarizer import summarize_product_pages

from summarizer import summarize_product_pages

本节导入必要的模块和函数。每个模块负责一项特定的任务:

This section imports the necessary modules and functions. Each module is responsible for a specific task:

a. image_utils.py :包含 OCR 逻辑。

a. image_utils.py: It contains OCR logic.

b. search_utils.py :它包含基于 LLM 的查询生成。

b. search_utils.py: It contains LLM-based query generation.

c. web_scraper.py :它构建电子商务搜索网址。

c. web_scraper.py: It constructs e-commerce search URLs.

d. summarizer.py :它从这些 URL 获取内容并对其进行总结。

d. summarizer.py: It fetches content from those URLs and summarizes it.

该系统采用模块化设计,便于维护和扩展。

The system is built in a modular fashion for easy maintenance and scalability.

2. 日志配置:它设置格式化的日志,以帮助调试和监控执行流程,并带有时间戳消息:

2. Logging configuration: It sets up formatted logging to aid debugging and monitor execution flow with time-stamped messages:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

日志记录功能配置为以带时间戳的结构化格式输出信息性消息和错误信息。这对于调试非常有用,尤其是在 OCR 识别失败、图像路径错误或 Web 响应失败的情况下。

Logging is configured to output informational messages and errors in a time-stamped, structured format. This is useful for debugging, especially if OCR fails, the image path is incorrect, or the web response fails.

3. 图像查找工具:它会在assets/目录中查找第一个有效的图像,作为 OCR 输入源:

3. Image finder utility: It locates the first valid image in the assets/ directory to serve as the OCR input source:

def find_first_image_in_assets():

def find_first_image_in_assets():

assets_folder = "assets"

assets_folder = "assets"

如果 os.path.exists(assets_folder):

if not os.path.exists(assets_folder):

raise FileNotFoundError(f"找不到资产文件夹'{assets_folder}'")

raise FileNotFoundError(f"Assets folder '{assets_folder}' not found")

for file in os.listdir(assets_folder):

for file in os.listdir(assets_folder):

如果文件.lower().endswith(('.jpg', '.jpeg', '.png', '.webp')):

if file.lower().endswith(('.jpg', '.jpeg', '.png', '.webp')):

返回 os.path.join(assets_folder, file)

return os.path.join(assets_folder, file)

raise FileNotFoundError("资源文件夹中未找到图像。")

raise FileNotFoundError("No image found in the assets folder.")

此辅助函数会检查assets/目录,并返回找到的第一个有效图像文件。如果该文件夹不存在或不包含任何受支持的图像,则会引发错误。这确保了流程始终有视觉输入。

This helper function looks inside the assets/ directory and returns the first valid image file it finds. If the folder does not exist or contains no supported images, it raises an error. This ensures that the pipeline always has a visual input to begin with.

4. 主管道执行:它通过调用图像查找器并启动后续处理步骤来开始端到端流程:

4. Main pipeline execution: It begins the end-to-end flow by invoking the image finder and initiating subsequent processing steps:

def main():

def main():

尝试:

try:

image_path = find_first_image_in_assets()

image_path = find_first_image_in_assets()

此步骤首先调用图像查找工具,启动工作流程。文件路径在image_path中指定,该路径将在 OCR 步骤中使用。

This starts the workflow by calling the image finding utility. The file path is stormed in image_path, which will be used in the OCR step.

5. OCR文本提取:使用EasyOCR从定位的图像中提取文本;如果未找到文本,则优雅地失败:

5. OCR text extraction: Uses EasyOCR to extract text from the located image; fails gracefully if no text is found:

extracted_text = extract_text_from_image(image_path)

extracted_text = extract_text_from_image(image_path)

如果未提取文本:

if not extracted_text:

raise ValueError("无法从图像中提取文本")

raise ValueError("No text could be extracted from the image")

logging.info(f"OCR提取文本:\n{extracted_text}")

logging.info(f"OCR Extracted Text:\n{extracted_text}")

EasyOCR 在此处读取图像并返回识别出的文本字符串。如果未检测到任何文本(空字符串),则会引发异常。结果会被记录以方便追踪。

Here, EasyOCR reads the image and returns a string of recognized text. If no text is detected (empty string), an exception is raised. The result is logged for traceability.

6. 通过本地语言模型生成搜索查询:使用本地语言模型将提取的文本转换为干净、用户友好的搜索查询,并对其进行清理:

6. Search query generation via LLM: Converts extracted text into a clean, user-like search query using a local language model and sanitizes it:

查询 = generate_search_query(提取的文本)

query = generate_search_query(extracted_text)

如果未查询:

if not query:

raise ValueError("无法生成有效的搜索查询")

raise ValueError("Failed to generate a valid search query")

query = query.replace('"', '').replace("₹", "rs").replace("or less", "under").replace("alternative", "")

query = query.replace('"', '').replace("₹", "rs").replace("or less", "under").replace("alternative", "")

logging.info(f"搜索查询:\n{query}")

logging.info(f"Search Query:\n{query}")

a. 现在,原始 O​​CR 文本被传递给本地 LLM(通过 Ollama),LLM 将其转换为类似用户搜索查询的格式,例如:

a. The raw OCR text is now passed to a local LLM (via Ollama), which transforms it into a user-like search query such as:

“800卢比以下的有线耳机”

"wired headphones under rs 800"

为了与搜索 URL 兼容,我们对一些基本数据进行了清理,包括清除特殊字符和标准化货币符号(| rs )。

Some basic sanitization is done to clean special characters and standardize currency symbols ( | rs) for compatibility with search URLs.

7. 重定向 URL 构建:根据生成的查询构建适用于亚马逊和 Flipkart 等平台的电子商务搜索结果 URL:

7. Redirect URL construction: Builds e-commerce search result URLs from the generated query for platforms like Amazon and Flipkart:

results = get_product_listings(query)

results = get_product_listings(query)

如果没有结果:

if not results:

logging.warning("未找到产品列表")

logging.warning("No product listings found")

返回

return

此查询通过调用`get_product_listings(query)`来构建亚马逊、Flipkart 和 eBay 的搜索 URL 。这些并非实际的 API 调用,而是直接重定向到相应网站的 URL。如果没有返回任何结果(这种情况不应该发生),则会记录一条警告信息。

This query is used to build search URLs for Amazon, Flipkart, and eBay by calling get_product_listings(query). These are not actual API calls but direct redirect URLs to the respective websites. If none are returned (which should not happen), a warning is logged.

8. 将 URL 打印到控制台:显示找到的产品的简要列表,包括名称、价格、商店和可点击的链接:

8. Print URLs to console: Displays a concise list of found products, including name, price, store, and a clickable link:

print("\n以下是一些类似或更好的替代方案,您可以查看:\n")

print("\nHere are some similar or better alternatives you can check out:\n")

for i, res in enumerate(results, 1):

for i, res in enumerate(results, 1):

print(f"{i}. {res['name']}")

print(f"{i}. {res['name']}")

print(f"价格:{res['price']}")

print(f" Price: {res['price']}")

print(f" 商店: {res['merchant']}")

print(f" Store: {res['merchant']}")

print(f"链接: {res['link']}\n")

print(f" Link: {res['link']}\n")

生成的搜索结果以简洁的格式显示。由于这些是重定向 URL(并非完整的商品列表),因此每个条目仅显示:

The generated search listings are printed in a clean format. Since these are redirect URLs (not full product listings), each entry just shows:

a. 商店名称

a. Store name

b. 链接到搜索结果页面

b. Link to the search result page

9. 网页摘要:它使用 LLM 从搜索结果页面中抓取并总结内容,以突出趋势和见解:

9. Summarize web snippets: It fetches and summarizes content from the search result pages using LLM to highlight trends and insights:

print("\n产品列表摘要:\n")

print("\nSummary of Product Listings:\n")

summaries = summarize_product_pages(results)

summaries = summarize_product_pages(results)

对于摘要中的 s:

for s in summaries:

print(f"{s['store']} 摘要:\n{s['summary']}\n")

print(f"{s['store']} Summary:\n{s['summary']}\n")

这是第二次 LLM 调用发生的地方。对于每个搜索 URL,程序执行以下操作:

This is where the second LLM call happens. For each search URL, the program does the following:

a. 使用请求获取网页

a. Fetches the web page using request.

b. 从 HTML 中提取可见的产品文本

b. Extracts visible product text from the HTML.

c. 使用 LLM 总结总体趋势(例如,顶级品牌、典型价格范围、共同特征)

c. Summarizes the overall trend using the LLM (e.g., top brands, typical price ranges, common features).

d. 这提供了根据您的查询,每个商店结果中流行趋势的清晰概述。

d. This provides a readable overview of what is trending in each store's results based on your query.

10. 强大的错误处理:捕获并记录文件、值和意外错误,以确保优雅地处理故障并生成有意义的日志:

10. Robust error handling: Captures and logs file, value, and unexpected errors to ensure graceful failure and meaningful logs:

异常 FileNotFoundError as e:

except FileNotFoundError as e:

logging.error(f"文件错误:{str(e)}")

logging.error(f"File error: {str(e)}")

异常 ValueError as e:

except ValueError as e:

logging.error(f"值错误:{str(e)}")

logging.error(f"Value error: {str(e)}")

除异常 e 外:

except Exception as e:

logging.error(f"发生意外错误:{str(e)}")

logging.error(f"An unexpected error occurred: {str(e)}")

a. 捕获到的异常类型有以下三种:

a. Three types of exceptions that are caught are as follows:

i. 缺少文件夹或图像

i. Missing folder or image.

ii. OCR 输出为空或 LLM 输出无效

ii. Empty OCR or invalid LLM output.

iii. 任何其他意外错误

iii. Any other unexpected errors.

这样可以确保管道优雅地处理故障并输出有用的日志。

This ensures the pipeline fails gracefully and outputs helpful logs.

b. 执行触发器:

b. Execution trigger:

如果 __name__ == "__main__":

if __name__ == "__main__":

主要的()

main()

这是脚本的执行入口点。它确保main()函数仅在脚本直接执行时运行,而不是在作为模块导入时运行。

This is the script’s execution entry point. It ensures the main() function only runs when the script is directly executed, not when imported as a module.

了解输出

Understanding the output

该输出结果代表了完整的OCR到LLM流程的最终结果,其中用户提供的产品图像被用于生成智能化的电子商务替代方案。图中所示为一款有线耳机的广告,价格和功能特性清晰可见。下详细分析了系统的运行情况及其输出结果。

This output represents the final result of a complete OCR-to-LLM pipeline, where a user-provided image of a product is used to generate intelligent e-commerce alternatives. The image in question shows a wired headphone advertisement with visible price and features. The following figure provide a breakdown of the system's behavior and its resulting output.

系统首先从包含有线耳机详细信息的产品图片入手,利用 EasyOCR 技术准确提取描述性文本。然后,本地逻辑学习模型 (LLM) 将原始文本转换为符合实际且目标明确的搜索查询,专门针对特定价格范围内的知名品牌(例如索尼和 JBL)的同类产品进行搜索。基于此查询,系统构建了亚马逊、Flipkart 和 eBay 的直接搜索 URL,模拟用户浏览在线商店的方式。接着,系统从这些搜索页面中提取可见的文本片段,并使用相同的 LLM 对结果进行总结。以 Flipkart 为例,模型识别出页面内容缺乏实质性的产品信息,而是侧重于促销语言和紧迫感,例如限时优惠和快速发货信息。这一结果不仅凸显了模型提取和解读数据的能力,还体现了其评估不同平台内容质量和相关性的能力,最终使用户能够仅凭视觉信息做出明智的购物决策。

Starting with a product image containing details of a wired headphone, the system accurately extracts descriptive text using EasyOCR. This raw text is then transformed by a local LLM into a realistic and goal-oriented search query, specifically looking for alternatives from well-known brands like Sony and JBL within a set price range. Using this query, the system constructs direct search URLs for Amazon, Flipkart, and eBay, mimicking how a human might explore online stores. It then retrieves visible text snippets from those search pages and summarizes the results using the same LLM. In the case of Flipkart, the model identifies that the page content lacks substantive product information and instead focuses on promotional language and urgency cues, such as limited-time offers and fast delivery messaging. This response not only highlights the model’s ability to extract and interpret data but also to assess the quality and relevance of content across different platforms, ultimately empowering users to make informed shopping decisions based on visual inputs alone.

屏幕截图显示了对索尼和 JBL 有线耳机产品列表的分析,重点突出了以下几个方面:限时优惠、更快的发货速度、无缝的订单跟踪以及产品描述中缺乏具体细节。

图 16.6:GenAI 系统的输出

Figure 16.6: Output from the GenAI system

对多模态文档进行光学字符识别

OCR on a multimodal document

对 PDF 等多模态文档进行 OCR 识别,需要提取并解读单个文件中包含的文本、图像和结构等多种内容。与纯图像不同,PDF 通常包含文本、扫描页面、表格、图像以及页眉、页脚和多列章节等布局元素。由虚拟语言模型 (VLM) 或 Mistral OCR 等基础模型驱动的高级 OCR 系统能够对这些文档进行整体处理,识别阅读顺序、提取表格和图形、保留格式并捕捉语义信息。当与基于模式的提取或文档问答( QA ) 功能集成时,即可实现对合同、发票、报告或学术论文的自动理解,从而使基于 PDF 的工作流程更加智能、可搜索且可由机器执行。

OCR on multimodal documents like PDFs involves extracting and interpreting a mix of textual, visual, and structural content within a single file. Unlike plain images, PDFs often include typed text, scanned pages, tables, images, and layout elements such as headers, footers, and multi-column sections. Advanced OCR systems powered by VLMs or foundation models like Mistral OCR, models can process these documents holistically, identifying reading order, extracting tables and figures, preserving formatting, and capturing semantic meaning. When integrated with schema-based extraction or document question and answer (QA) capabilities, this enables automated understanding of contracts, invoices, reports, or academic papers, making PDF-based workflows intelligent, searchable, and machine-actionable.

以下图文片段说明了视觉数据(图表+表格)和文本数据如何共同展现1996年至2022年教育程度的发展历程:

The following image-text snippet illustrates how visual (chart + table) and textual data together convey the progression of educational attainment from 1996 to 2022:

堆叠条形图显示了南非(20 岁以上)在 1996 年、2001 年、2011 年和 2022 年的教育程度,随着时间的推移,高等教育和中等教育的比例不断增加,而未接受教育的比例则不断下降。

图 16.7:适用于 OCR 处理的多模态文档示例

Figure 16.7: An example of a multimodal document suitable for OCR processing

图 16.7展示了一个多模态数据源示例,该数据源结合了结构化可视化元素(柱状图和表格)和非结构化文本描述,构成了一个丰富而复杂的输入,非常适合基于 OCR 的文档理解。在多模态 AI 系统中,此类图像不仅需要文本提取,还需要布局解析、图形元素和文本元素之间的语义对齐以及上下文推理。利用由 LLM 驱动的先进 OCR 技术,可以实现精确转录、结构保留和有意义的解析,从而将静态视觉内容转化为可操作的、机器可读的洞察。这为智能分析报告、教育趋势和政策文件奠定了关键基础。

Figure 16.7 exemplifies a multimodal data source that combines structured visuals (bar charts and tables) with unstructured textual descriptions, representing a rich and complex input ideal for OCR-driven document understanding. In the context of multimodal AI systems, such images require not only text extraction but also layout interpretation, semantic alignment between graphical and textual elements, and contextual reasoning. Leveraging advanced OCR techniques powered by LLMs enables accurate transcription, structure preservation, and meaningful interpretation, transforming static visual content into actionable, machine-readable insights. This forms a critical foundation for intelligent analysis of reports, educational trends, and policy documents.

Mistral 的 OCR

Mistral's OCR

Mistral 的 OCR 技术栈将传统的被动式 OCR 转录转变为主动式、结构化且交互式的系统。它提供分层功能:保留布局的文本提取、基于模式的数据捕获以及通过 LLM 实现的上下文感知查询。这些功能——包括基础 OCR、注释和文档质量保证——使开发人员能够构建复杂的文档智能应用程序,从提取表格和为图表添加注释到创建类似聊天的助手。它们共同构成了一个多功能的基础,可用于构建实际的 GenAI 流水线,如下所示,支持跨多种文档类型的丰富多模态工作流程:

Mistral’s OCR stack transforms traditional OCR from passive transcription into an active, structured, and interactive system. It offers layered capabilities: layout-preserving text extraction, schema-based data capture, and context-aware querying via LLMs. These functions—Basic OCR, annotations, and document QA—enable developers to build sophisticated document intelligence applications, from extracting tables and captioning figures to creating chat-like assistants. Together, they form a versatile foundation for real-world GenAI pipelines, as shown in the following, supporting rich multimodal workflows across diverse document types:

  • 基础 OCR :Mistral 的基础 OCR 是一款功能齐全的识别系统,它不仅能转录文本,还能提取文本及其结构信息,例如标题、段落、列表和表格,并以 Markdown 格式返回结果,从而实现与文档或流程的无缝集成。该模型能够保留文档布局和层级结构,处理多列和复杂设计,并支持图像和 PDF 输入(通过 URL 或 base64)。此外,它还提供边界框和元数据,使开发人员能够精确定位原始图像中的每一段文本,这对于注释或文档重建等任务至关重要。
  • Basic OCR: Mistral’s basic OCR is a fully featured recognition system that does more than just transcribe text. It extracts text along with structural information, like headers, paragraphs, lists, and tables, and returns results in Markdown format, enabling seamless integration into documentation or pipelines. The model preserves document layout and hierarchy, handles multi-column and complex designs, and works with image and PDF inputs (via URLs or base64). It also provides bounding boxes and metadata, allowing developers to precisely locate each piece of text in the original image, crucial for tasks like annotation or document reconstruction.
  • 注释:注释功能基于基础 OCR,能够从文档中按结构化模式提取目标信息。注释功能分为以下两种类型:
    • bbox_annotation :它允许用户指定边界框(例如,包含图表或图形的区域),并接收针对这些区域量身定制的标题或描述。
    • document_annotation :它将整个文档中的结构化数据提取为符合开发者定义模式的JavaScript 对象表示法( JSON ) 格式。这对于自动化表单、发票或法律文件中的数据录入非常有用,可以将自由格式的扫描输入转换为可直接使用的结构化数据集。
  • Annotations: Building on Basic OCR, the annotations feature enables structured, schema-driven extraction of targeted information from documents. There are two types, which are as follows:
    • bbox_annotation: It allows users to specify bounding boxes (e.g., areas containing charts or figures) and receive captions or descriptions tailored to those regions.
    • document_annotation: It extracts structured data from the entire document into a JavaScript Object Notation (JSON) format that aligns with developer-defined schemas. This is highly useful for automating data entry from forms, invoices, or legal documents, transforming free-form scanned input into directly usable structured datasets.
  • 文档质量保证:它通过集成上下文感知语言学习模型(LLM)来提升基于OCR的流程。OCR处理并提取结构化文本内容及其布局(包括标题、段落、表格)后,系统可以回答关于文档的自然语言问题。例如,用户可以询问“总金额是多少?”列出所有列出的流程LLM会利用其对文本、结构和关系的理解来回答这些问题。此功能使文档理解超越了简单的提取,支持分析、摘要或多文档比较。应用场景包括表单处理、法律审查和学术分析。让我们来了解一下代码实现:

    步骤 1:安装 Mistral 客户端

    !pip install mistralai --quiet

    # 步骤 2:导入所需模块

    导入操作系统

    从 mistralai.client 导入 Mistral

    from mistralai.models.chat_completion import ChatMessage

    # 第三步:设置 API 密钥(确保密钥设置安全)

    os.environ["MISTRAL_API_KEY"] = "your_mistral_api_key" # 替换为您的实际密钥

    api_key = os.environ["MISTRAL_API_KEY"]

    # 上传并获取文档 URL

    with open("/content/sample_data/educational_attainment_figure.pdf", "rb") as f:

    uploaded_file = client.files.upload(

    文件={"file_name": "educational_attainment_figure.pdf", "content": f},

    目的="ocr"

    signed_url = client.files.get_signed_url(file_id=uploaded_file.id)

    提问

    消息 = [

    聊天消息

    role="user",

    内容=[

    {"type": "text", "text": "总结多年来学校后教育的发展情况。"},

    {"type": "document_url", "document_url": signed_url.url}

    ]

    ]

    response = client.chat.complete(

    model="mistral-small-latest",

    messages=messages

    print(response.choices[0].message.content)

  • Document QA: It elevates OCR-based pipelines by integrating context-aware LLMs. After OCR processes and extracts structured textual content along with layout (including headings, paragraphs, tables), the system can answer natural language questions about the document. For example, a user could ask, what is the total amount due? or list all the procedures outlined, and the LLM responds using its understanding of the text, structure, and relationships. This functionality enables document understanding beyond extraction, supporting analysis, summarization, or multi-document comparisons. Use cases include form processing, legal review, and academic analysis. Let us understand the code implementation:

    # Step 1: Install the Mistral client

    !pip install mistralai --quiet

    # Step 2: Import required modules

    import os

    from mistralai.client import Mistral

    from mistralai.models.chat_completion import ChatMessage

    # Step 3: Set API key (ensure your key is securely set)

    os.environ["MISTRAL_API_KEY"] = "your_mistral_api_key" # Replace with your actual key

    api_key = os.environ["MISTRAL_API_KEY"]

    # Upload and get document URL

    with open("/content/sample_data/educational_attainment_figure.pdf", "rb") as f:

    uploaded_file = client.files.upload(

    file={"file_name": "educational_attainment_figure.pdf", "content": f},

    purpose="ocr"

    )

    signed_url = client.files.get_signed_url(file_id=uploaded_file.id)

    # Ask a question

    messages = [

    ChatMessage(

    role="user",

    content=[

    {"type": "text", "text": "Summarize post-school education growth over the years."},

    {"type": "document_url", "document_url": signed_url.url}

    ]

    )

    ]

    response = client.chat.complete(

    model="mistral-small-latest",

    messages=messages

    )

    print(response.choices[0].message.content)

上下文中的正则表达式

The regex in context

以下正则表达式代码的目的是自动检测和提取用户输入消息中的 URL,特别是针对 PDF 等文档链接。

The purpose of the following regex code is to automatically detect and extract URLs from a user's input message, specifically targeting document links such as PDFs.

Mistral 的文档质量保证功能允许您在文本旁边附加文档 URL。提示。但用户可能会输入类似这样的内容:你能总结一下这篇论文吗?https://arxiv.org/pdf/2410.07073

Mistral's document QA feature allows you to attach document URLs alongside your text prompt. But users might type something like: Can you summarize this paper? https://arxiv.org/pdf/2410.07073

为了实现这一点,系统需要:

To make this work, the system needs to:

  1. 确认是否存在URL。
  2. Identify that a URL is present.
  3. 从完整的用户消息中解析出该信息。
  4. Parse it out of the full user message.
  5. 将其作为{"type": "document_url", "document_url": ...}条目附加到消息有效负载中。

    这里需要导入 re# 以支持基于正则表达式的 URL 提取。

    def extract_urls(text: str) -> list:

    url_pattern = r'\b((?:https?|ftp)://(?:www\.)?[^\s/$.?#].[^\s]*)\b'

    urls = re.findall(url_pattern, text)

    返回网址

  6. Attach it as a {"type": "document_url", "document_url": ...} entry to the message payload.

    import re# is required here to support the regex-based URL extraction.

    def extract_urls(text: str) -> list:

    url_pattern = r’\b((?:https?|ftp)://(?:www\.)?[^\s/$.?#].[^\s]*)\b’

    urls = re.findall(url_pattern, text)

    return urls

  7. 提取URL后:

    user_message_content = [{"type": "text", "text": user_input}]

    对于 document_urls 中的每个 url:

    user_message_content.append({"type": "document_url", "document_url": url})

  8. Once URLs are extracted:

    user_message_content = [{"type": "text", "text": user_input}]

    for url in document_urls:

    user_message_content.append({"type": "document_url", "document_url": url})

  9. 这意味着 Mistral 的 API 会接收以下内容:
    1. 你的问题(文本形式)
    2. 您的 PDF 或文档 URL(作为单独识别的输入)
  10. This means Mistral's API receives the following:
    1. Your question (as text)
    2. Your PDF or document URL (as a separately recognized input)

收据数据中的 OCR 功能

OCR in receipt data

下图展示了一个典型的半结构化收据示例,这种收据常见于诸如综合收据数据集( CORD )等数据集中。这些收据包含丰富的文本信息,包括详细的产品清单、数量、单价、税费计算和付款总额汇总,所有信息均以视觉上较为复杂的布局呈现。从这类文档中提取结构化信息是现代文档理解研究的一项基础性任务。该图像可作为评估 OCR 和文档解析系统的实际基准,尤其适用于使用 Mistral OCR 等基础模型或 Llama 3.2 Vision(通过 Ollama)等多模态模型进行键值提取和表格检测。

The following figure represents a typical example of a semi-structured receipt commonly found in datasets like the Consolidated Receipt Dataset (CORD). These receipts contain rich textual information, including itemized product listings, quantities, unit prices, tax calculations, and total payment summaries, all formatted in visually complex layouts. Extracting structured information from such documents is a foundational task in modern document understanding research. This image serves as a real-world benchmark to evaluate OCR and document parsing systems, particularly for key-value extraction and table detection using foundation models like Mistral OCR or multimodal models such as Llama 3.2 vision via Ollama.

一张打印在白纸上的餐厅收据,用印尼语列出了所点菜品、数量和价格。底部的总金额用浅蓝色突出显示,显示为 1,565,938。收据放在木质表面上。

图 16.8:包含文本表格数据的收据

Figure 16.8: A receipt that consists of textual tabular data

以下 Python 代码演示了如何使用Meta 的 Llama 3.2 视觉模型,通过 Ollama 运行时环境进行基于图像的文档理解。该方法集成了计算机视觉和自然语言理解,允许用户上传图像并使用自然语言进行查询,大型多模态模型会生成结构化的响应。该代码专为Google Colab等环境设计,图像文件存储在默认数据目录中。

The following Python code demonstrates the way to perform image-based document understanding using Meta's Llama 3.2 vision model via the Ollama runtime. This approach integrates computer vision and natural language understanding by allowing a user to upload an image and query it in natural language, with the large multimodal model producing a structured response. The code is designed for use in environments like Google Colab, where the image file is stored in a default data directory.

该流程的核心逻辑在于调用 ` ollama.chat()`方法,其中模型参数设置为`llama3.2-vision` ,表明正在使用支持视觉功能的 Llama 3.2 实例。提示信息“获取图像中的所有数据”作为消息内容以用户角色发送,图像本身则以列表形式通过`images`键传递。LLM 处理图像后,会在响应对象的`message['content']`字段中返回结构化的文本响应。`strip ()`函数确保在显示响应之前移除所有前导或尾随空格。本例中,模型输出包含详细的发票元数据,例如公司名称、地址、账单接收人和明细条目,展现了模型解析布局丰富的文档(例如发票)的能力。此示例表明,与传统的 OCR 相比,该模型不仅能够直接从图像输入中捕获文本,还能捕获上下文、关系和层级结构,从而显著提升了文档自动化应用场景的智能化程度。

The core logic of the pipeline involves invoking the ollama.chat() method, where the model parameter is set to llama3.2-vision, indicating that a vision-enabled Llama 3.2 instance is being used. The prompt get all the data from the image is sent as the message content under the user role, and the image itself is passed in a list under the images key. Once the LLM processes the image, it returns a structured textual response within the message['content'] field of the response object. The strip() function ensures that any leading or trailing whitespace is removed from the response before displaying it. The model output in this case includes detailed invoice metadata such as company name, address, billing recipient, and line-item entries, showcasing the model’s ability to parse layout-rich documents like invoices. This example illustrates a significant advancement over traditional OCR by capturing not just text but also context, relationships, and hierarchies directly from image input, thus facilitating more intelligent document automation use cases.

  • 通过 Ollama Llama 3.2 vision 进行视觉数据提取:

    导入羊驼

    image_path = # 替换为你的图片路径

  • Visual data extraction via Ollama Llama 3.2 vision:

    import ollama

    image_path = # Replace with your image path

这行代码设置了要处理的图像的路径。在典型的 Colab 设置中,这将是/content/sample_data/invoice_sample.jpg

This line sets the path of the image to be processed. In a typical Colab setup, this would be /content/sample_data/invoice_sample.jpg.

Python

python

编辑

CopyEdit

响应 = ollama.chat(

response = ollama.chat(

型号="llama3.2-vision",

model="llama3.2-vision",

messages=[{

messages=[{

"角色": "用户",

"role": "user",

"content": "从图像中获取所有数据",

"content": "get all the data from the image",

"images": [image_path]

"images": [image_path]

}],

}],

)

此模块使用 Ollama 的 API 与 Llama 3.2 视觉模型进行交互。该模型处理图像并返回其内容的文本分解,理想情况下,该分解应为收据的结构化摘要。

This block uses Ollama’s API to interact with the Llama 3.2 vision model. The model processes the image and returns a textual breakdown of its content, ideally a structured summary of the receipt.

cleaned_text = response['message']['content'].strip()

cleaned_text = response['message']['content'].strip()

清除响应中的空白字符,以便进行后续处理。

The response is cleaned of whitespace to prepare it for further processing.

  • 基于提示的 LangChain 结构提取

    from langchain_ollama import ChatOllama

    from langchain_core.prompts import ChatPromptTemplate

    from langchain_core.output_parsers import StrOutputParser

  • Prompt-based structure extraction using LangChain:

    from langchain_ollama import ChatOllama

    from langchain_core.prompts import ChatPromptTemplate

    from langchain_core.output_parsers import StrOutputParser

这部分内容为提示链式调用搭建了 LangChain 生态系统。用户定义一个模板提示,指示模型提取并返回特定字段(例如,公司名称、收据编号、商品列表、总计)。

This portion sets up the LangChain ecosystem for prompt chaining. The user defines a template prompt instructing the model to extract and return specific fields (e.g., company name, receipt number, item list, total).

llm = ChatOllama(模型=“llama3”,温度=0)

llm = ChatOllama(model="llama3", temperature=0)

这将使用 Ollama(目前是纯文本模型)初始化与标准 Llama 3 模型的 LLM 连接。

This initializes a LLM connection to the standard Llama 3 model using Ollama (text-only model now).

chain = (prompt | llm | StrOutputParser())

chain = (prompt | llm | StrOutputParser())

返回 chain.invoke({"response": response})

return chain.invoke({"response": response})

这里,提示信息与模型和解析器连接起来,然后使用清理后的 OCR 文本执行。输出结果应为 JSON 格式。

Here, the prompt is chained with the model and parser, then executed with the cleaned OCR text. The output is expected to be JSON-formatted.

  • 使用正则表达式提取 JSON 并解析它
  • Extract JSON using Regex and parse it:

json_match = re.search(r"```\n(.*?)\n```", result, re.DOTALL)

json_match = re.search(r"```\n(.*?)\n```", result, re.DOTALL)

该函数会在模型响应中查找用三个反引号“`”括起来的 JSON 块。

This searches for a JSON block enclosed in triple backticks ``` inside the model response.

parsed_data = json.loads(receipt_data)

parsed_data = json.loads(receipt_data)

提取后,使用json.loads将 JSON 字符串解析为 Python 字典

Once extracted, the JSON string is parsed into a Python dictionary using json.loads.

  • 转换为 DataFrame 进行分析:

    receipt_dict = json.loads(json_data)

    items_df = pd.DataFrame(receipt_dict['Items'])

  • Convert to DataFrame for analysis:

    receipt_dict = json.loads(json_data)

    items_df = pd.DataFrame(receipt_dict['Items'])

收据字典会通过将Items列表转换为 Pandas DataFrame 进行进一步处理,从而可以进行数据分析、聚合或可视化等进一步操作。

The receipt dictionary is further processed by converting the Items list into a Pandas DataFrame, which enables further operations like data analysis, aggregation, or visualization.

这段代码展示了一个类似 RAG 的多模态系统,它结合了图像理解(Llama 3.2 视觉)、基于提示的语义提取(LangChain)和结构化输出(JSON | DataFrame)。它有力地证明了基础模型如何在自动化的端到端流程中连接非结构化的视觉输入和结构化的分析。

This code exemplifies a multimodal RAG-like system, combining image understanding (Llama 3.2 vision), prompt-based semantic extraction (LangChain), and structured output (JSON | DataFrame). It is a compelling example of how foundational models can bridge unstructured visual inputs and structured analytics in an automated, end-to-end pipeline.

待办事项

To do

作为本章的延伸,我们鼓励读者使用CORD数据集探索现实世界的 OCR 挑战。CORD 是一个公开数据集,专门用于从商店收据中提取信息。该数据集包含图像 PDF 和相应的 JSON 注释,使其成为测试文档理解系统在半结构化财务文档上性能的理想选择。读者可以通过训练词元分类器或使用布局感知提示策略,尝试提取商家名称、商品明细、总计和税额。关键在于超越简单的文本提取,开发能够理解文档语义和格式的端到端流程。

As an extension to this chapter, readers are encouraged to explore real-world OCR challenges using the CORD, a publicly available dataset curated for information extraction from store receipts. This dataset consists of image-PDFs and corresponding JSON annotations, making it an ideal candidate for testing document understanding systems on semi-structured financial documents. Readers can experiment with extracting merchant names, itemized purchases, totals, and tax values, either by training their token classifiers or using layout-aware prompting strategies. The key task is to go beyond raw text extraction and develop end-to-end pipelines that understand document semantics and formatting.

在实现方面,读者可以选择两种前沿方法之一。首先,他们可以利用 Mistral 的文档质量保证 API,该 API 可自动应用 OCR 技术,并允许使用文档 URL 进行结构化质量保证。这种方法具有可扩展性,且设置极少。其次,读者可以尝试使用 Meta 的 Llama 3.2 视觉模型,该模型基于 Ollama 运行时环境,支持多模态图像输入。在这种设置下,可以将收据作为图像传递给模型,并附带定制的提示(例如,列出此收据中的所有商品及其价格),从而实现视觉语义推理。这项任务旨在鼓励学生结合数据集工程、提示设计和多模态语言学习模型,创建稳健、高精度的文档理解系统。

For implementation, readers may choose one of two cutting-edge approaches. First, they can leverage Mistral’s document QA API, which automatically applies OCR and allows for structured QA using document URLs. This approach is scalable and requires minimal setup. Alternatively, readers can experiment with Meta’s Llama 3.2 vision model using the Ollama runtime, which supports multimodal image inputs. In this setup, receipts can be passed as images to the model with tailored prompts (e.g., list all the items and their prices from this receipt), enabling visual-semantic reasoning. This task encourages students to combine dataset engineering, prompt design, and multimodal LLMs to create robust, high-accuracy document understanding systems.

结论

Conclusion

在本章中,我们探讨了三种在多模态数据环境下执行 OCR 的不同但又互补的方法。首先,我们展示了如何将 EasyOCR 等传统 OCR 工具集成到 GenAI 流程中,以从图像中提取文本并进行推理,从而实现对非结构化视觉输入的智能解读。其次,我们介绍了 Mistral OCR,我们构建了一个专为文档理解而训练的基础模型,该模型通过 API 驱动的文档质量保证 (QA) 提供结构化输出,从而简化了复杂 PDF 的 OCR 处理流程。最后,我们考察了多模态语言学习模型 (LLM)(例如 Meta 的 Llama Vision 系列)在处理嵌入表格数据的收据图像方面的强大功能,重点展示了它们能够同时解析布局、提取内容并生成语义结构化输出。这些方法共同构成了一套强大的工具包,用于构建下一代 OCR 系统,从而弥合原始视觉输入与可操作的结构化理解之间的鸿沟。

In this chapter, we explored three distinct yet complementary approaches to performing OCR in the context of multimodal data. First, we demonstrated how traditional OCR tools like EasyOCR can be wrapped within a GenAI pipeline to extract and reason over text from images, enabling intelligent interpretation of unstructured visual inputs. Second, we introduced Mistral OCR, a foundation model natively trained for document understanding, which streamlines OCR on complex PDFs by providing structured outputs through API-driven document QA. Lastly, we examined the power of multimodal LLMs, such as Meta’s Llama vision series, in handling receipt images with embedded tabular data, highlighting their ability to simultaneously interpret layout, extract content, and generate semantically structured outputs. Together, these methods provide a robust toolkit for building next-generation OCR systems that bridge the gap between raw visual input and actionable structured understanding.

下一章,我们将重点介绍如何使用 GenAI 对传统模型进行封装,例如推荐引擎。

In the next chapter, we will focus on wrapping traditional models with GenAI, e.g., recommendation engines.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第十七将传统人工智能/机器学习集成到 GenAI 工作流程中

CHAPTER 17Integrating Traditional AI/ML into GenAI Workflow

介绍

Introduction

随着传统机器学习( ML ) 和生成式人工智能( GenAI )之间的界限日益模糊,构建融合二者优势的混合系统变得越来越有价值。本章将探讨如何将分类器、回归器和聚类算法等传统 ML 模型封装并集成到 GenAI 智能体的工作流程中。通过使这些模型能够作为工具在智能体推理循环中调用,我们释放了强大的功能,使生成式智能体不仅可以进行对话和生成,还可以进行精准的预测、分类和推荐。

As the boundaries between traditional machine learning (ML) and generative AI (GenAI) continue to blur, there is increasing value in creating hybrid systems that combine the strengths of both. In this chapter, we explore how to wrap and integrate conventional ML models, such as classifiers, regressors, and clustering algorithms, into GenAI agent workflows. By making these models callable as tools within agentic reasoning loops, we unlock powerful capabilities where generative agents can not only converse and generate but also predict, classify, and recommend with precision.

您将学习如何利用 scikit-learn、LangChain 和轻量级 Python 微服务等技术,通过应用程序编程接口( API ) 公开机器学习模型,并使其与 GenAI 代理无缝交互。我们将逐步讲解实际应用,包括将推荐引擎作为可调用工具集成到大型语言模型( LLM ) 推理链中。在此过程中,我们将解决关键的运维挑战,例如 API 延迟、错误处理和模型版本控制,从而确保生产环境系统的稳健性和可靠性。

Using technologies like scikit-learn, LangChain, and lightweight Python microservices, you will learn how to expose ML models via application programming interfaces (APIs) and make them interact seamlessly with GenAI agents. We will walk through practical implementations, including a recommendation engine integrated as a callable tool within a large language model (LLM) reasoning chain. Along the way, we will address key operational challenges such as API latency, error handling, and versioning of models, ensuring robustness and reliability in production-ready systems.

在本章结束时,您将构建一个功能齐全的混合系统,其中 GenAI 代理会在其思维链( CoT ) 中动态调用机器学习预测。这种推理与预测的融合为构建不仅能流畅对话而且具有强大分析能力的智能系统铺平了道路。

By the end of this chapter, you will have built a fully functioning hybrid system where GenAI agents dynamically invoke ML predictions as part of their chain of thought (CoT). This fusion of reasoning and prediction paves the way for intelligent systems that are not only conversationally fluent but also analytically powerful.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 案例研究
  • Case study
  • 将传统模型融入GenAI
  • Integrating the traditional model within GenAI
  • 用例
  • Use case
  • 将 XGBoost 模型封装到 LLM 中
  • Wrapping XGBoost model into LLM
  • GenAI工作流程中机器学习模型集成的比较概述
  • Comparative overview of ML model integration in GenAI workflows
  • 待办事项
  • To do

目标

Objectives

本章旨在指导读者完成混合人工智能系统的端到端开发,该系统融合了传统机器学习和现代全基因组人工智能(GenAI)。具体而言,本章演示了如何训练、部署和封装基于极端梯度提升XGBoost )的欺诈检测模型,并将其封装成应用程序接口(API),然后使用类似 Mistral 的语言学习模型(LLM)通过自然语言与之交互。读者将学习如何从文本中提取结构化特征、以编程方式调用机器学习工具、解释模型输出以及生成可操作的解释,所有这些都在一个模块化、可用于生产环境的架构中完成。其目标是通过全基因组人工智能代理,使传统的机器学习模型更易于访问、解释和使用。

The objective of this chapter was to guide readers through the end-to-end development of a hybrid AI system that integrates traditional ML with modern GenAI. Specifically, it demonstrated how to train, deploy, and wrap an Extreme Gradient Boosting (XGBoost) fraud detection model as an API, and then use an LLM like Mistral to interface with it via natural language. Readers will learn how to extract structured features from text, call ML tools programmatically, interpret model outputs, and generate actionable explanations, all within a modular, production-ready architecture. The goal was to make traditional ML models accessible, explainable, and usable through GenAI agents.

案例研究

Case study

X Analytics是一家拥有 15 名成员的人工智能初创公司,专注于为中型电商平台提供智能零售解决方案。过去两年,该团队开发了一系列传统机器学习模型,包括基于协同过滤的推荐引擎、使用 XGBoost 的客户流失预测模型以及基于自定义逻辑回归分类器训练的产品分类模型。这些模型此前已手动集成到仪表盘或批处理作业中,但缺乏实时交互功能。

Company X Analytics is a 15 member AI startup specializing in intelligent retail solutions for mid-sized e-commerce platforms. Over the past two years, the team developed a suite of traditional ML models, including a collaborative filtering based recommendation engine, a churn prediction model using XGBoost, and a product categorization model trained on custom logistic regression classifiers. These models had been manually integrated into dashboards or batch jobs, but lacked real-time, interactive utility.

随着 GenAI 的发展势头强劲,该公司订阅了OpenAI 的生成式预训练 Transformer ( GPT ) 和Anthropic 的 Claude等商业语言学习模型 (LLM ),旨在构建一个对话式助手,帮助零售经理做出更明智、更快速的决策。然而,挑战显而易见:如何将传统机器学习模型中蕴含的智能与语言学习模型的推理能力和语言流畅性相结合。

As GenAI gained momentum, the company subscribed to commercial LLMs like OpenAI's Generative Pre-trained Transformer (GPT) and Anthropic’s Claude, intending to build a conversational assistant that could help retail managers make smarter, faster decisions. However, the challenge was clear, which was how to bridge the intelligence embedded in their traditional ML models with the reasoning and language fluency of LLMs.

以下案例研究展示了 RastrAI 如何通过将传统模型封装到 GenAI 工作流中,从而连接传统机器学习和 GenAI。通过结合 LangChain 代理、RESTful 机器学习微服务和 CoT 提示,该公司构建了一个智能系统,能够以推理和预测的双重准确度回答复杂的业务问题。

The following case study illustrates how RastrAI bridged traditional ML and GenAI by wrapping legacy models into a GenAI workflow. By combining LangChain agents, RESTful ML microservices, and CoT prompting, the company built an intelligent system capable of answering complex business questions with both reasoning and predictive accuracy.

  • 问题陈述:该公司需要构建一个混合型 GenAI 代理,能够回答如下等细致入微的问题:
    • 为什么B类客户群的客户留存率下降?
    • 下周我应该向低活跃度用户推荐哪些产品?

    这些查询既需要自然语言理解,又需要直接访问现有的机器学习见解,而机器学习模型本身无法做到这一点。

  • Problem statement: The company needed to build a hybrid GenAI agent capable of answering nuanced questions like the following:
    • Why is customer retention dropping for Segment B?
    • Which products should I recommend next week to low-engagement users?

    These queries required both natural language understanding and direct access to existing ML insights, something that LLMs could not do out of the box.

  • 解决方案:工程团队采用 LangChain 创建工具增强型代理,这些代理可以在推理过程中调用传统的机器学习模型。每个机器学习模型都使用 FastAPI 封装成一个微服务,并公开了诸如`/predict_churn` `/get_recommendations``/classify_product`之类的 RESTful 端点。这些端点接受结构化的有效负载,并返回JavaScript 对象表示法( JSON ) 响应。

    团队随后定义了与这些 API 直接对应的 LangChain 工具对象。在 CoT 提示下,LLM 被指示根据用户的意图调用正确的工具。例如,如果经理询问“本月有哪些高风险客户?” ,代理将解析输入,调用流失预测 API,并以流畅的语言返回可操作的见解。

  • Solution: The engineering team adopted LangChain to create tool-augmented agents that could call traditional ML models as part of their reasoning process. Each ML model was wrapped as a microservice using FastAPI, exposing RESTful endpoints like /predict_churn, /get_recommendations, and /classify_product. These endpoints accepted structured payloads and returned JavaScript Object Notation (JSON) responses.

    The team then defined LangChain tool objects that mapped directly to these APIs. With CoT prompting, the LLM was instructed to invoke the right tool-based on the user's intent. For example, if a manager asked, what are some high-risk customers this month? The agent would parse the input, call the churn prediction API, and return actionable insights in fluent language.

  • 成果:在不到八周的时间内,RastrAI 成功地从孤立的机器学习工作流程过渡到统一的 GenAI 驱动系统。其内部工具 RetailGenie 成为业务分析师的智能助手,融合了对话推理和预测智能。最终实现了更快的决策速度、更有效地利用现有模型,以及更现代化、原生于人工智能的用户体验( UX ),而无需重新训练或弃用现有的机器学习技术栈。
  • Outcome: In under eight weeks, RastrAI successfully transitioned from siloed ML workflows to a unified GenAI-powered system. Their internal tool, RetailGenie, became an intelligent co-pilot for business analysts, blending conversational reasoning with predictive intelligence. The result was faster decision-making, better utilization of legacy models, and a more modern, AI-native user experience (UX), without the need to re-train or discard their existing ML stack.

虽然本案例研究为虚构,但它反映了许多现代企业面临的真实场景。如今,企业正通过订阅或API集成在机器学习模型(LLM)上投入巨资,同时又拥有丰富的传统人工智能和机器学习模型,涵盖从推荐引擎到风险评分系统等各种类型。与其对庞大且成本高昂的机器学习模型进行微调,或从头开始重建现有解决方案,企业可以采用这种混合方法来最大化价值。通过将传统模型封装为机器学习模型驱动的代理中的可调用工具,企业可以创建智能系统,将特定领域的洞察与全人工智能(GenAI)的自然语言推理能力相结合,从而在加速创新的同时,保护过去的投资。

While this case study is fictional, it reflects a real scenario faced by many modern enterprises. Today, organizations are heavily investing in LLMs through subscriptions or API integrations, while simultaneously sitting on a rich legacy of traditional AI and ML models, ranging from recommendation engines to risk scoring systems. Instead of fine-tuning large, costly LLMs or rebuilding existing solutions from scratch, companies can adopt this hybrid approach to maximize value. By wrapping their traditional models as callable tools within LLM-powered agents, they can create intelligent systems that combine domain-specific insights with the natural language reasoning of GenAI, accelerating innovation while preserving past investments.

将传统模型与世代人工智能相结合

Integrating the traditional model with GenAI

随着人工智能领域迈入GenAI时代,挑战与机遇在于如何将传统人工智能模型与新兴的机器学习模型(LLM)的能力相融合。企业通常拥有大量预先构建的机器学习和深度学习模型,这些模型是为特定的预测或感知任务而设计的。与其弃用或微调现有的LLM模型以适应这些用例,不如采用更灵活的方式。这种模块化且经济高效的方法将传统模型封装成可调用工具,并通过基于LLM的代理对其进行协调。这使得智能系统成为可能,其中LLM作为推理层,而传统模型则执行高精度预测任务。

As the field of AI transitions into the era of GenAI, the challenge and opportunity lie in bridging traditional AI models with the emergent capabilities of LLMs. Enterprises often possess a portfolio of pre-existing ML and deep learning models designed for specific predictive or perceptual tasks. Rather than discarding or fine-tuning LLMs for these use cases, a more modular and cost-effective approach involves wrapping traditional models as callable tools and orchestrating them via LLM-based agents. This enables intelligent systems where LLMs serve as the reasoning layer, while traditional models perform high-accuracy predictive tasks.

本节探讨如何利用工具增强型代理,将各种传统人工智能/机器学习模型(从分类器和回归器到卷积神经网络( CNN ) 和光学字符识别( OCR ))无缝集成到 GenAI 工作流程中。通过 API 公开模型,并使语言学习模型 (LLM) 能够解释其输出并采取行动,开发人员可以构建兼具预测准确性和自然语言交互能力的智能系统,详情如下:

This section explores how various traditional AI/ML models, from classifiers and regressors to convolutional neural networks (CNNs) and optical character recognition (OCR), can be seamlessly integrated into GenAI workflows using tool-augmented agents. By exposing models via APIs and enabling LLMs to interpret and act on their outputs, developers can build intelligent systems that combine predictive accuracy with natural language interaction, details as follows:

  • 分类模型:用于分配标签,例如欺诈/非欺诈或类别归属。这些模型对于决策任务至关重要。
    • 例如:使用逻辑回归或 XGBoost 进行欺诈检测。
      • API 暴露器:分类器使用 FastAPI 或 Flask 进行部署,以暴露类似/predict_class 的端点
      • 工具调用:LangChain 或自定义 LLM 代理定义了一个工具模式,该模式会将特征向量发送到此端点。
      • LLM 推理:LLM 解释用户输入(此交易是否欺诈?),将其转换为结构化输入,并以人类可读的术语解释输出(此交易有 87% 的概率是欺诈。)。
      • 代理实用性:代理可以根据分类结果触发后续操作,例如通知合规官或记录交易。
  • Classification models: It is used to assign labels such as fraud/not-fraud or category membership. These models are essential for decision-making tasks.
    • Example: Fraud detection using logistic regression or XGBoost.
      • API exposurer: The classifier is deployed using FastAPI or Flask to expose an endpoint like /predict_class.
      • Tool calling: A LangChain or custom LLM agent defines a tool schema that sends feature vectors to this endpoint.
      • LLM reasoning: The LLM interprets user input (is this transaction fraudulent?), transforms it into structured input, and explains the output in human-readable terms (this transaction has an 87% probability of being fraud.).
      • Agent utility: The agent can trigger follow-up actions, e.g., notifying a compliance officer or logging the transaction, based on classification results.
  • 回归模型:用于连续值预测,这些模型有助于估计价格、分数或风险等结果。
    • 例如:使用线性回归或 LightGBM 进行房价预测。
      • API 公开:模型通过类似/predict_price 的端点提供服务,接受数值特征(位置、大小、年份)。
      • 工具调用:该工具发送结构化输入并检索预测值。
      • LLM 推理:LLM 提供了可解释性(基于面积和配套设施,预测值为950 万卢比)。
      • 代理效用:输出结果可用于下游工具,以进行贷款资格评估或投资建议。
  • Regression models: Used for continuous value prediction, these models help estimate outcomes like prices, scores, or risks.
    • Example: House price prediction using linear regression or LightGBM.
      • API exposure: Model is served via an endpoint like /predict_price, accepting numerical features (location, size, year).
      • Tool calling: The tool sends structured inputs and retrieves predicted values.
      • LLM reasoning: The LLM provides interpretability (based on area and amenities, the predicted value is 95 lakhs.)
      • Agent utility: The output can feed into downstream tools for loan eligibility estimation or investment recommendation.
  • 预测模型:这些时间序列模型能够根据历史趋势估计未来值,从而实现预测性规划。
    • 例如:使用自回归移动平均( ARIMA ) 或 Prophet进行时间序列销售预测。
      • API 接口:类似/forecast_sales 的REST 端点接收历史数据并输出未来值。
      • 工具调用:工具格式可能涉及嵌套的时间序列数组。
      • LLM 推理:LLM 解释时间趋势并解释季节性影响(由于过去的模式,预计排灯节期间销售额将激增)。
      • 代理实用性:可用于实时触发库存重新订购或动态定价建议。
  • Forecasting models: These time-series models enable predictive planning by estimating future values based on historical trends.
    • Example: Time-series sales forecasting using Autoregressive Integrated Moving Average (ARIMA) or Prophet.
      • API exposure: A REST endpoint like /forecast_sales takes historical data and outputs future values.
      • Tool calling: The tool format may involve nested time-series arrays.
      • LLM reasoning: The LLM interprets temporal trends and explains seasonal effects (sales are expected to spike during Diwali due to past patterns.).
      • Agent utility: It can be used to trigger inventory reordering or dynamic pricing recommendations in real-time.
  • 人工神经网络:人工神经网络ANN )用于捕捉结构化数据中的非线性关系,为许多分类和预测任务提供支持。
    • 示例:使用前馈神经网络进行客户流失预测。
      • API 暴露:通过 TensorFlow Serving 或 TorchServe 等服务框架进行封装。
      • 工具调用:LLM 代理格式化客户属性并调用端点/predict_churn
      • LLM 推理:LLM 解释了 ANN 捕捉到的非线性关系(高用户留存率和使用率表明用户流失风险低)。
      • 代理实用性:代理可以根据客户流失风险对客户进行细分,并优先发送客户挽留优惠或电子邮件。
  • Artificial neural networks: Used for capturing non-linear relationships in structured data, artificial neural networks (ANNs) power many classification and prediction tasks.
    • Example: Customer churn prediction using a feedforward neural network.
      • API exposure: Wrapped via a serving framework like TensorFlow Serving or TorchServe.
      • Tool calling: The LLM agent formats customer attributes and invokes the endpoint/predict_churn.
      • LLM reasoning: The LLM explains non-linear relationships captured by the ANN (high tenure and usage suggest low churn risk).
      • Agent utility: The agent can segment customers and prioritize retention offers or emails based on churn risk.
  • 卷积神经网络:主要用于图像分类任务,卷积神经网络是视觉检测、缺陷检测或识别的关键。
    • 示例:图像分类(例如,产品缺陷检测)
      • API 公开:CNN 模型通过类似/classify_image 的端点提供服务该端点接受 base64 或文件 URL 格式。
      • 工具调用:代理预处理图像输入,将其发送到 CNN,并接收标签。
      • LLM 推理:LLM 对标签进行上下文关联(检测到的裂纹表明机械故障)。
      • 代理实用性:根据分类,代理可以启动质量检查工作流程或通知操作。
  • CNNs: Primarily used in image classification tasks, CNNs are key for visual inspection, defect detection, or recognition.
    • Example: Image classification (e.g., defect detection on products)
      • API exposure: CNN model is served via an endpoint like /classify_image, which accepts base64 or file URL formats.
      • Tool calling: The agent preprocesses the image input, sends it to the CNN, and receives a label.
      • LLM reasoning: The LLM contextualizes the label (the crack detected suggests mechanical failure.).
      • Agent utility: Based on the classification, the agent can initiate a quality check workflow or notify operations.
  • 分割模型:这些模型将图像分割成有意义的部分,可用于医学成像、文档布局分析或目标检测。
    • 示例:医学图像中的语义分割(例如,MRI 扫描中的肿瘤检测)。
      • API 公开:该模型通过/segment_image等端点提供服务,返回分割掩码或叠加层。
      • 工具调用:代理发送图像数据并处理返回的掩码。
      • LLM 推理:LLM 可以解释分割结果,突出显示的区域对应于可能的肿瘤边界。
      • 代理实用性:通过将 LLM 叙述生成与像素级模型输出相结合,实现临床系统中的自动报告生成或决策支持。
  • Segmentation models: These models divide images into meaningful parts, useful for medical imaging, document layout analysis, or object detection.
    • Example: Semantic segmentation in medical imaging (e.g., tumor detection in MRI scans).
      • API exposure: The model is served with endpoints like /segment_image, returning a segmented mask or overlay.
      • Tool calling: The agent sends image data and processes returned masks.
      • LLM reasoning: The LLM can interpret the segmentation, and the highlighted region corresponds to a likely tumor boundary.
      • Agent utility: Enables automatic report generation or decision support in clinical systems by combining LLM narrative generation with pixel-level model outputs.
  • OCR 模型:用于将图像中的印刷或手写文本转换为机器可读格式;这些模型可以从视觉文档中解锁结构化数据。
    • 例如:使用 Tesseract、EasyOCR 或自定义的基于转换器的 OCR 管道从扫描文档、发票或身份证中提取文本。
      • API 公开:OCR 模型通过/extract_text等端点进行部署,该端点接受图像文件(例如 PNG、JPEG),并以 JSON 格式返回识别的文本。
      • 工具调用:GenAI 代理使用上传的图像调用工具,接收提取的文本,并可选择进行结构后处理(例如,键值对)。
      • LLM 推理:LLM 解释原始 OCR 输出,该文档是供应商 X 于 2024 年 3 月 12 日开具的发票,应付总额为 1,274 美元。
      • 代理实用性:它通过将 OCR 输出与基于 LLM 的自然语言理解相结合,实现下游任务,例如文档分类、自动数据输入或对话式查询(发票号码是什么? )。

        第 16 章“GenAI 从图像中提取文本”中,我们探讨了如何通过将 OCR 模型封装为可调用工具,将 OCR 功能与 GenAI 集成。

  • OCR models: It is used to convert printed or handwritten text from images into a machine-readable format; these models unlock structured data from visual documents.
    • Example: Text extraction from scanned documents, invoices, or identity cards using Tesseract, EasyOCR, or a custom transformer-based OCR pipeline.
      • API exposure: The OCR model is deployed via an endpoint such as /extract_text, which accepts image files (e.g., PNG, JPEG) and returns recognized text in JSON format.
      • Tool calling: The GenAI agent invokes the tool with the uploaded image, receives extracted text, and optionally post-processes for structure (e.g., key-value pairs).
      • LLM reasoning: The LLM interprets the raw OCR output, this document is an invoice from Vendor X dated March 12, 2024, with a total payable amount of $1,274.
      • Agent utility: It enables downstream tasks such as document classification, automated data entry, or conversational querying (what is the invoice number?) by combining OCR output with LLM-based natural language understanding.

        In Chapter 16, GenAI for Extracting Text from Images, GenAI for Extracting Text from Images we have explored how to integrate OCR capabilities with GenAI by wrapping OCR models as callable tools.

这些混合系统的初始化

Initialization of these hybrid systems

在混合型 GenAI/ML 系统中,传统的机器学习流程既可以通过用户向 LLM 发出的交互式指令触发,也可以通过代理协调的批量工作流自动触发。当用户直接交互时,他们会发出自然语言查询,例如“你能预测一下这位客户的流失风险吗?”或者“从这张收据中提取文本并总结关键信息。” LLM 解释意图,构建所需的输入,并通过预定义的 API 包装器调用相应的 ML 工具,例如流失预测模型或 OCR 服务。

In a hybrid GenAI/ML system, traditional ML processes can be triggered either interactively via user instructions to the LLM or automatically through batch workflows orchestrated by the agent. When users engage directly, they issue natural language queries such as can you predict the churn risk for this customer? or extract text from this receipt and summarize the key details. The LLM interprets the intent, structures the required inputs, and calls the corresponding ML tool, such as a churn prediction model or an OCR service, via predefined API wrappers.

此外,在批量或后台处理中,LLM 代理可以遍历任务队列(例如,每日图像文件夹、交易日志),并自主调用传统模型。例如,一个定时代理可以每晚使用 OCR 分析所有上传的发票,并将提取的数据传递给财务异常检测器。这些操作通过类似 LangChain 的编排层或微服务管道进行初始化,这些编排层或管道监控触发器或工作流,并据此协调工具调用。

Alternatively, in batch or background processing, the LLM agent may iterate over a queue of tasks (e.g., daily image folders, transaction logs) and autonomously invoke traditional models. For instance, a scheduled agent may analyze all uploaded invoices every night using OCR and pass the extracted data to a financial anomaly detector. These operations are initialized through LangChain-like orchestration layers or microservice pipelines that monitor triggers or workflows and coordinate tool invocations accordingly.

该设计支持按需推理和自动 ML 执行,使组织能够将 GenAI 的灵活性与传统模型的精确性相结合,从而在最大限度地减少人工干预的情况下实现欺诈检测、推荐引擎和客户分析等应用。

This design supports both on-demand reasoning and automated ML execution, allowing organizations to combine GenAI's flexibility with the precision of legacy models, enabling applications like fraud detection, recommendation engine, and customer analytics with minimal manual intervention.

用例

Use case

本文以电信欺诈检测为例,探讨了混合集成学习方法的应用。电信行业面临的欺诈检测挑战依然严峻,身份欺骗、SIM卡克隆、非法理赔等欺诈活动会对收入和客户信任构成重大威胁。与合法交易相比欺诈事件的稀少性导致数据集高度不平衡,使得传统的分类方法难以有效应对。本案例研究提出了一种基于集成机器学习的方法,并结合深度学习技术,用于检测真实电信理赔数据集中的欺诈行为。该数据集的欺诈与非欺诈数据比例为6:94。

This is a case study of hybrid ensemble learning for telecom fraud detection, as fraud detection remains a critical challenge in the telecommunications industry, where fraudulent activities, such as identity spoofing, Subscriber Identity Module (SIM) cloning, and illegitimate claim submissions, pose substantial risks to revenue and customer trust. The rarity of fraudulent instances compared to legitimate transactions results in highly imbalanced datasets, making conventional classification methods inadequate. This case study presents an ensemble-based ML approach, augmented by deep learning methods, to detect fraud in a real-world telecom claims dataset characterized by a 6:94 fraud-to-non-fraud ratio.

数据特征和预处理

Data characteristics and preprocessing

该数据集包含匿名化的理赔相关特征,包括客户元数据、理赔提交间隔和验证标志。值得注意的是,诸如IS_MISSING_MOBILE HOUR_TO_RAISE_CLAIMTOTAL_VERIFICATIONS等变量具有特定领域的语义价值,并未采用统计方法进行插补。相反,这些特征采用基于标志的方法进行编码,以保持可解释性。分类特征采用标签编码,数值属性则使用 Z 分数归一化进行标准化。缺失值和零膨胀特征均经过可视化和显式处理,以确保下游模型的稳健性。

The dataset comprised anonymized claim-related features, including customer metadata, claim submission intervals, and verification flags. Notably, variables such as IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM, and TOTAL_VERIFICATIONS carried domain-specific semantic value and were not imputed using statistical means. Instead, such features were encoded using flag-based approaches to preserve interpretability. Categorical features were label-encoded, and numerical attributes were standardized using Z-score normalization. Missing values and zero-inflated features were visualized and handled explicitly to ensure robust downstream model behavior.

基线模型开发与评估

Baseline model development and evaluation

我们开发了一个初始的 XGBoost 模型,并引入了scale_pos_weight参数来解决类别不平衡问题。我们没有采用默认的0.5决策阈值,而是应用了一种阈值调优机制。我们计算了多个阈值下的精确率、召回率和 F1 分数,并选择最优阈值以最大化 F1 分数,从而在欺诈检测(召回率)和误报率(精确率)之间取得平衡。

An initial XGBoost model was developed, incorporating the scale_pos_weight parameter to address class imbalance. Instead of relying on the default decision threshold of 0.5, a threshold tuning mechanism was applied. Precision, recall, and F1 scores were computed across multiple thresholds, and the optimal cutoff was selected to maximize the F1 score, achieving a trade-off between fraud detection (recall) and false alarm reduction (precision).

我们使用标准分类指标评估模型性能,包括混淆矩阵、精确率-召回率PR )曲线、受试者工作特征ROC )曲线、马修斯相关系数MCC )和Cohen's Kappa系数。这种多指标评估方法能够全面展现模型在不平衡数据集条件下的可靠性。

Performance was evaluated using standard classification metrics, including the confusion matrix, precision-recall (PR) curve, receiver operating characteristic (ROC) curve, Matthews correlation coefficient (MCC), and Cohen’s Kappa score. This multi-metric evaluation provided a comprehensive view of model reliability under imbalance conditions.

堆叠式集成学习方法

Stacked ensemble learning approach

为了进一步提升模型的泛化能力和鲁棒性,我们构建了一个堆叠式集成分类器。该分类器由XGBoost、LightGBM和梯度提升分类器组成。它们的个体概率输出被传递给一个元分类器——逻辑回归,该元分类器学习如何最优地组合它们的输出。集成模型采用分层训练集/测试集划分进行训练,并使用与基线模型相同的指标进行评估。

To further improve generalization and model robustness, a stacked ensemble classifier was constructed. The base learners included XGBoost, LightGBM, and the gradient boosting classifier. Their individual probability outputs were passed to a meta-classifier, logistic regression, which learned to optimally combine their outputs. The ensemble was trained using a stratified train-test split and evaluated on the same metrics as the baseline model.

与任何单一模型相比,堆叠集成模型展现出更优异的性能。它在保持较高精确率的同时,显著提高了欺诈检测的召回率,从而最大限度地减少了漏报和误报。ROC-AUC 和 PR-AUC 值均显著提升,MCC 和 Kappa 值也证实了模型稳定性的增强。

The stacked ensemble demonstrated superior performance compared to any single model. It yielded higher recall for fraud detection while maintaining competitive precision, thus minimizing both false negatives and false positives. The ROC-AUC and PR-AUC scores improved notably, and the MCC and Kappa values confirmed increased model stability.

该研究强调了在高度不平衡数据集上,采用堆叠架构组合基于树的分类器进行欺诈检测的有效性。此外,阈值优化和领域信息预处理对于提高实际应用性至关重要。所提出的方法可以集成到生产系统中用于欺诈风险评分,并支持基于SHAP的可解释性或实时欺诈监控API的扩展。如果我们需要将上述集成模型或XGBoost模型(如欺诈检测案例研究中所述)与LLM结合使用,则LLM将作为围绕已训练预测模型的推理、协调和解释层。

The study underscores the efficacy of combining tree-based classifiers in a stacked architecture for fraud detection in highly imbalanced datasets. Moreover, threshold optimization and domain-informed preprocessing were essential for improving real-world applicability. The proposed approach can be integrated into production systems for fraud risk scoring and supports extensibility for SHAP-based interpretability or real-time fraud monitoring APIs. If we have to use the above ensemble and or an XGBoost model (as described in the fraud detection case study) in conjunction with an LLM, the LLM serves as a reasoning, orchestration, and explanation layer around your already-trained predictive model.

在此设置中 LLM 的作用

Purpose of the LLM in this setup

以下是LLM在混合式教育体系中将发挥的作用和目的的详细说明:

The following is a breakdown of the roles and purposes the LLM would serve in a hybrid system:

  • 自然语言界面
    • 目的:它允许非技术用户(例如,欺诈分析师、索赔审核员)使用自然语言与 XGBoost 模型进行交互。
    • 示例输入请问这个索赔是否涉嫌欺诈?缺少手机号码,而且索赔是在深夜提出的
    • LLM 操作:解析此输入,从数据中提取相关特征(IS_MISSING_MOBILE=1 HOUR_TO_RAISE_CLAIM=2 ),并通过工具/API 将它们发送到 XGBoost 模型。
    • 工具调用/模型封装
      • 用途:它充当控制器,决定何时以及如何调用包装为可调用工具的 XGBoost 模型(例如,通过 LangChain 或 FastAPI)。
      • 机制

        工具(

        名称="欺诈评分工具",

        func=call_xgboost_api,

        描述="预测电信索赔中的欺诈概率。"

    LLM 在内部推理欺诈时会调用此工具。

  • Natural language interface:
    • Purpose: It allows the non-technical users (e.g., fraud analysts, claim reviewers) to interact with the XGBoost model using natural language.
    • Example input: Can you check if this claim is fraudulent? The mobile number is missing, and the claim was raised late at night.
    • LLM action: Parse this input, extract relevant features from the data (IS_MISSING_MOBILE=1, HOUR_TO_RAISE_CLAIM=2), and send them to the XGBoost model via a tool/API.
    • Tool invocation/model wrapping:
      • Purpose: It acts as a controller that decides when and how to invoke the XGBoost model wrapped as a callable tool (e.g., via LangChain or FastAPI).
      • Mechanism:

        Tool(

        name="FraudScoringTool",

        func=call_xgboost_api,

        description="Predicts fraud probability for a telecom claim."

        )

    The LLM calls this tool internally when reasoning about fraud.

  • 结果解读与解释
    • 目的:它将原始输出(例如,欺诈概率 = 0.92 )翻译成人类能够理解的语言,通常还带有上下文。
    • 示例输出
      • 该索赔存在很高的欺诈可能性(92%)。模型将其标记为欺诈的原因是缺少移动数据、在非工作时间提交索赔以及验证次数过低。
      • 当与 SHAP 等可解释性工具一起使用时,这一点尤其有价值,LLM 可以用通俗易懂的英语叙述最具影响力的特征。
  • Result interpretation and explanation:
    • Purpose: It translates the raw output (e.g., fraud probability = 0.92) into human-understandable language, often with context.
    • Example output:
      • This claim has a high fraud probability (92%). The model flagged it due to missing mobile data, off-hours claim submission, and low verification count.
      • This is especially valuable when used with explainability tools like SHAP, where the LLM can narrate the most influential features in plain English.
  • 链式决策/工作流触发
    • 目的:根据模型结果,LLM代理可以决定下一步操作:
      • 标记为需人工审核
      • 自动拒绝索赔
      • 要求用户提供更多证据
  • Chained decision-making/workflow triggering:
    • Purpose: Based on the model result, the LLM agent can decide the next step:
      • Flag for manual review
      • Auto-reject the claim
      • Ask the user for additional evidence

这类似于CoT推理,并辅以模型输出。

This is akin to CoT reasoning, augmented by model outputs.

  • 审计跟踪/报告生成
    • 目的:生成欺诈检测过程的结构化摘要,结合模型评分、解释和推理。
    • 示例索赔编号 #4532 已被标记为待审核。欺诈评分:0.92。主要因素:联系信息未经核实、提交延迟以及地址字段缺失。建议分析师采取行动。
  • Audit trail/report generation:
    • Purpose: Generate structured summaries of the fraud detection process, combining model scores, explanations, and reasoning.
    • Example: Claim ID #4532 has been flagged for review. Fraud score: 0.92. Key factors: unverified contact info, delayed submission, and missing address fields. Analyst action recommended.

将 XG 提升模型封装到 LLM 中

Wrapping XG boost model into LLM

现有的 XGBoost 流水线能够进行高度精细的欺诈分类。相关代码可在 GitHub 代码库中找到,其中包含阈值调优、特征选择和可视化等功能。为了在此基础上引入逻辑推理层 (LLM),我们添加了一个智能推理层。首先,LLM 作为自然语言接口,允许用户询问“此索赔是否可能是欺诈?”。LLM 解析用户查询,提取结构化特征(例如,IS_MISSING_MOBILE HOUR_TO_RAISE_CLAIM ),并通过 API 封装器或 LangChain 工具调用 XGBoost 模型。

The existing XGBoost pipeline performs highly refined fraud classification. The code can be found in GitHub repository, featuring threshold tuning, feature selection, and visualization. To augment this with an LLM, we introduce an intelligent reasoning layer. First, the LLM acts as a natural language interface, allowing users to ask, is this claim likely to be fraudulent? The LLM parses user queries, extracts structured features (e.g., IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM), and invokes the XGBoost model via an API wrapper or LangChain tool.

以下架构图展示了一个集成系统,该系统将基于 XGBoost 模型的传统欺诈检测流程与基于 GenAI 的现代聊天界面相结合。上半部分概述了数据采集过程,其中交易数据和人口统计数据使用pandas进行预处理,随后通过scikit-learn训练 XGBoost 模型。训练好的欺诈检测模型通过FastAPI接口对外开放。下半部分展示了用户与基于 LangChain 的对话代理进行交互的过程,该代理利用欺诈模型作为工具。代理在 Ollama 托管的 LLM(例如 Mistral)的支持下进行推理,从而生成上下文相关的回复。

The following architecture illustrates an integrated system that combines a traditional fraud detection pipeline using an XGBoost model with a modern GenAI-based chat interface. The upper section outlines the data ingestion process, where transaction and demographic data are preprocessed using pandas and subsequently used to train an XGBoost model via scikit-learn. This trained fraud detection model is exposed through a FastAPI interface. In the lower section, a user interacts with a LangChain-powered conversational agent that leverages the fraud model as a tool. The agent performs reasoning with the support of an Ollama-hosted LLM (e.g., Mistral) to generate contextual responses.

流程图显示交易和人口统计数据进入包含 sklearn、FastAPI 和 LangChain 的管道,从而生成欺诈检测模型和使用生成式人工智能 (Ollama) 的响应。

图 17.1 用 GenAI 取代传统模型的架构图

Figure 17.1: Architecture diagram of swapping tradition model with GenAI

接下来,LLM 执行协调工作,确定何时触发预测、重新运行阈值调整或生成SHapley 加性解释( SHAP ) 值。例如,如果用户询问为什么此声明被标记,LLM 会解释模型输出,并可以请求特征重要性图或调用 SHAP 解释器模块。

Next, the LLM performs orchestration, determining when to trigger predictions, re-run threshold tuning, or generate SHapley Additive exPlanations (SHAP) values. For example, if a user asks why was this claim flagged? the LLM interprets the model output and can request the feature importance plot or call a SHAP explainer module.

LLM提供解释,将数值预测和阈值转化为人类可读的推理:

The LLM provides explanation, converting numerical predictions and thresholds into human-readable reasoning:

根据提交速度快且缺少手机号码信息,该索赔的欺诈概率高达 91%,超过了 0.55 的最佳 F1 阈值。

This claim has a 91% fraud probability based on rapid submission and missing mobile details. It crosses the optimal F1 threshold of 0.55.

因此,LLM 将技术性的 ML 流程转变为易于理解、可解释且交互式的欺诈检测系统,分析师和决策者无需直接的编码专业知识即可使用该系统。

Thus, the LLM transforms a technical ML pipeline into an accessible, explainable, and interactive fraud detection system usable by analysts and decision-makers without direct coding expertise.

如图 17.2所示,requirements.txt文件指定了构建、训练、部署和编排混合 LLM-XGBoost 欺诈检测系统所需的所有依赖项。它包括用于模型开发和预处理的核心机器学习库,例如 XGBoost、scikit-learn 和 Pandas,以及用于 RESTful API 服务的 FastAPI 和 Uvicorn。LangChain 和 Ollama 等依赖项支持通过本地 LLM 后端进行基于自然语言工具的推理。这种统一的规范确保项目可以在不同环境中以一致的方式进行设置,并支持可复现的实验、LLM 驱动的推理工作流以及可扩展的生产部署,同时最大限度地减少配置开销。使用 `pip install -r requirements.txt`安装所有依赖项。

The requirements.txt file, as shown in Figure 17.2 specifies all necessary dependencies for building, training, serving, and orchestrating the hybrid LLM-XGBoost fraud detection system. It includes core ML libraries such as XGBoost, scikit-learn, and Pandas for model development and preprocessing, as well as FastAPI and Uvicorn for RESTful API serving. Dependencies like LangChain and Ollama enable natural language tool-based reasoning through a local LLM backend. This unified specification ensures that the project can be setup consistently across environments and supports reproducible experimentation, LLM-driven inference workflows, and scalable production deployment with minimal configuration overhead. pip install -r requirements.txt to install all dependencies.

requirements.txt 文件的屏幕截图,其中列出了 Python 包及其具体版本,包括 xgboost、scikit-learn、pandas、numpy、joblib、fastapi、uvicorn、langchain、ollama、requests、matplotlib 和 seaborn。

图 17.2 混合项目的需求和依赖关系

Figure 17.2: Requirements and dependencies for the hybrid project

运行顺序

Run order

要设置和运行完整的流程,请按以下步骤操作,从模型训练开始,到启动 API 并执行 GenAI 代理:

To setup and run the complete pipeline, refer to the following steps in order, starting from model training to launching the API and executing the GenAI agent:

1. 训练模型python model/train_xgb_model.py

1. Train the model: python model/train_xgb_model.py

2. 启动 FastAPI 服务器uvicorn api.fraud_model_api:app --reload --port 8000

2. Start the FastAPI server: uvicorn api.fraud_model_api:app --reload --port 8000

3. 使用 LangChain + Ollama 运行 LLM 代理python agent/run_agent.py

3. Run the LLM agent with LangChain + Ollama: python agent/run_agent.py

为了确保模块化、可维护性和易于部署,本项目采用了清晰的分层文件夹结构,将模型训练、API 服务、LLM 编排和实用逻辑分离,如下图所示。每个组件,例如数据预处理、XGBoost 建模、FastAPI 集成和 LangChain 工具封装,都隔离在各自的目录中,从而提升了可扩展性和清晰度。模型文件夹包含下游推理所需的所有工件,而 API 则将这些资源作为 REST 端点公开。工具和代理层支持通过 Ollama 推理代理实现与结构化机器学习预测的自然语言交互。这种结构既支持迭代开发,也支持无缝过渡到生产级系统。

To ensure modularity, maintainability, and ease of deployment, this project adopts a clean, layered folder structure that separates model training, API serving, LLM orchestration, and utility logic, as shown in the following figure. Each component, like the data preprocessing, XGBoost modeling, FastAPI integration, and LangChain tool wrapping, is isolated in its own directory, promoting scalability and clarity. The model folder contains all artifacts necessary for downstream inference, while API exposes these assets as REST endpoints. The tools and agent layers enable natural language interaction with structured ML predictions via Ollama-powered reasoning agents. This structure supports both iterative development and seamless transition to production-grade systems.

这是欺诈检测机器学习项目的文件目录树的屏幕截图,显示了数据、模型、API、工具、LLM 代理、实用程序和评估图图像的文件夹和文件。

图 17.3:混合项目的文件夹结构

Figure 17.3: Folder structure for the hybrid project

要保存训练好的 XGBoost 模型以及其他必要的组件(如选定的特征、缩放器和标签编码器),您可以使用 joblib(由于其性能优于 pickle,因此推荐用于大型模型)。

To save your trained XGBoost model, along with other necessary components like selected features, scaler, and label-encoders, you can use joblib (recommended for large models due to better performance over pickle).

下图概述了使用 FastAPI、LangChain 和 Ollama 将传统 XGBoost 模型集成到 GenAI 代理中的端到端工作流程:

The following figure outlines the end-to-end workflow for integrating a traditional XGBoost model into a GenAI agent using FastAPI, LangChain, and Ollama:

流程图包含六个步骤:训练 XG-Boost 模型、保存工件(模型、编码器、特征)、通过 Fast API 保存模型、工具调用 Fast API、Langchain 工具封装、使用 Ollama 的 LLM 代理。箭头指向从左到右的方向。

图 17.4 将 XGBoost 模型封装到 GenAI 中的端到端流程

Figure 17.4: End-to-end pipeline for wrapping an XGBoost model into a GenAI

代码实现

Code implementation

本实现方案提出了一种模块化混合系统,其中传统的 XGBoost 分类器通过 FastAPI 服务公开,并由使用 LangChain 的 GenAI 代理进行编排。该应用场景基于电信欺诈检测。该系统重点展示了如何将现有的机器学习流程集成到代理工作流中,从而增强系统的可解释性和可用性。

This implementation presents a modular hybrid system where a traditional XGBoost classifier is exposed through a FastAPI service and orchestrated by a GenAI agent using LangChain. The use case is based on telecom fraud detection. The system highlights how existing ML pipelines can be integrated into agentic workflows for enhanced interpretability and usability.

模型训练流程

Model training pipeline

机器学习后端采用 XGBoost 分类器构建。脚本train_xgb_model.py执行以下顺序步骤:

The ML backend is built using an XGBoost classifier. The script train_xgb_model.py performs the following sequential steps:

1. 数据准备:数据集从data/dummy_test_vif_filtered_imputed_cleaned.csv加载,分类特征进行标签编码,数值特征使用StandardScaler进行标准化

1. Data preparation: The dataset is loaded from data/dummy_test_vif_filtered_imputed_cleaned.csv, and categorical features are label-encoded while numerical features are standardized using StandardScaler.

2. 特征选择递归特征消除RFE )选择前 10 个最具预测性的特征。

2. Feature selection: recursive feature elimination (RFE) selects the top 10 most predictive features.

3. 模型训练:使用这些特征训练一个类别加权的 XGBoost 模型来处理不平衡的欺诈数据。

3. Model training: A class-weighted XGBoost model is trained using these features to handle imbalanced fraud data.

4. 模型评估:绘制精确率、召回率、F1 分数、MCC、ROC 和 PR 曲线等性能指标。进行阈值调优以确定最佳决策边界。

4. Model evaluation: Performance metrics such as precision, recall, F1 score, MCC, ROC, and PR curves are plotted. Threshold tuning is performed to identify the optimal decision boundary.

5. 模型保存:使用 joblib 将训练好的模型及其相关的预处理对象(scaler label_encoders、selected_features )保存到model/目录中。

5. Model saving: The trained model and its associated preprocessing objects (scaler, label_encoders, selected_features) are saved using joblib into the model/ directory.

图 17.5显示训练流程已成功完成,生成了一个高性能的用于欺诈检测的 XGBoost 模型。经过编码和缩放后,该模型进行了 RFE(随机因子提取),以保留信息量最大的预测因子。阈值调优阶段表明,决策阈值设为 0.70 时 F1 分数最大。在此阈值下,分类器总体准确率达到 89%,欺诈类别的精确率为 0.25,召回率为 0.44。诸如 MCC(0.270)和 Cohen's Kappa(0.257)等评估指标显示出中等程度的一致性,证实了该模型在处理类别不平衡问题时能够有效减少假阳性和假阴性。

Figure 17.5 shows that the training pipeline successfully completed, producing a high-performing XGBoost model for fraud detection. After encoding and scaling, the model underwent RFE to retain the most informative predictors. A threshold tuning phase revealed that a decision threshold of 0.70 maximized the F1 score. At this threshold, the classifier achieved an overall accuracy of 89%, with a precision of 0.25 and a recall of 0.44 for the fraud class. Evaluation metrics such as MCC (0.270) and Cohen’s Kappa (0.257) indicate moderate agreement, confirming the model’s effectiveness in handling class imbalance while minimizing false positives and false negatives.

终端屏幕截图显示了机器学习管道的输出,包括模型训练步骤、特征选择、阈值调整结果、分类报告指标以及评估分数(如 MCC 和 Cohen's Kappa)。

图 17.5:成功的传统模型竞赛

Figure 17.5: Successful traditional model competition

训练过程还会在 model 目录下生成以下文件:

The training process will also generate these files under the directory model:

  • xgb_model_final.pkl
  • xgb_model_final.pkl
  • label_encoders.pkl
  • label_encoders.pkl

xgb_model_final.pkl文件包含已训练的 XGBoost 分类器,该分类器针对欺诈检测进行了优化。它是 API 和 GenAI 代理使用的核心预测引擎。selected_features.pkl 文件存储了通过 RFE 识别出的前 10 个特征,确保推理过程中仅使用最相关的输入。scaler.pkl 文件包含一个StandardScaler对象,用于对数值输入特征进行归一化,以与训练保持一致。

The xgb_model_final.pkl file contains the trained XGBoost classifier optimized for fraud detection. It is the core predictive engine used by the API and the GenAI agent. The selected_features.pkl stores the top 10 features identified through RFE, ensuring only the most relevant inputs are used during inference. The scaler.pkl holds a StandardScaler object used to normalize numerical input features for consistency with training.

最后,label_encoders.pkl包含LabelEncoder对象,用于将类别输入特征转换为数值形式,保留模型训练期间使用的编码逻辑,以实现可靠的实时预测。

Lastly, label_encoders.pkl contains LabelEncoder objects for transforming categorical input features into numerical form, preserving the encoding logic used during model training for reliable real-time predictions.

FastAPI 服务层

FastAPI serving layer

训练好的 XGBoost 模型通过fraud_model_api.py中的 FastAPI 提供服务。主要组件包括:

The trained XGBoost model is served via FastAPI in fraud_model_api.py. Key components include the following:

  • 模型工件加载:启动时,服务会检查并加载所有必需的工件。
  • Model artifact loading: At startup, the service checks for and loads all required artifacts.
  • 模式定义:API 使用pydantic.BaseModel来验证传入的请求。
  • Schema definition: The API uses pydantic.BaseModel to validate incoming requests.
  • 预测端点/predict_fraud端点接收结构化的索赔特征,使用保存的缩放器和编码器对其进行预处理,并返回预测的欺诈概率。
  • Prediction endpoint: The /predict_fraud endpoint takes structured claim features, preprocesses them using the saved scalers and encoders, and returns the predicted fraud probability.

如下图所示,该层还包括跨域资源共享CORS )中间件,以方便未来的前端集成:

As shown in the following figure, this layer also includes cross-origin resource sharing (CORS) middleware to facilitate future frontend integrations:

终端窗口显示 INFO 日志:正在监控目录更改,Uvicorn 服务器正在 http://127.0.0.1:8000 运行,重新加载进程已启动,并发出警告,提示需要 Pandas 1.3.6+ 和 bottleneck 库。

图 17.6:该图显示了使用 WatchFiles 启动的重载进程 [22380]

Figure 17.6: The figure shows the started reloader process [22380] using WatchFiles

用于 FastAPI 推理的工具封装

Tool wrapper for FastAPI inference

fraud_tool.py文件定义了一个实用函数call_fraud_model(features: dict) ,它用作工具包装器:

The fraud_tool.py file defines a utility function call_fraud_model(features: dict), which serves as a tool wrapper:

  • 它向 FastAPI 服务器发送包含输入特征的 POST 请求。
  • It sends a POST request to the FastAPI server with the input features.
  • 它会根据索赔时间和缺失的手机信息等输入条件,解释返回的欺诈概率,并添加自然语言解释。
  • It interprets the returned fraud probability and adds a natural language explanation based on input conditions such as time of claim and missing mobile information.

下图显示了在端口 8080 上运行的 FastAPI:

The following figure shows FastAPI running on port 8080:

欺诈检测 API 文档界面的屏幕截图,显示了 POST 端点 /predict_fraud 和可展开的模式部分:ClaimFeatures、HTTPValidationError 和 ValidationError。
这是运行代码的终端截图。图中显示了 Python 命令,使用 pip 安装依赖项,以及提及兼容的 Python 版本和模块(例如 libstdc++ 和 distutils)的输出信息。

图 17.7:该图显示 FastAPI 已在端口 8080 上启动并运行。

Figure 17.7: The figure shows that the FastAPI is up and running on port 8080

LangChain 工具注册

LangChain tool registration

在langchain_fraud_tool.py,前面的包装器被公开为一个与 LangChain 兼容的工具:

In langchain_fraud_tool.py, the preceding wrapper is exposed as a LangChain-compatible tool:

from langchain_core.tools import Tool

from langchain_core.tools import Tool

from tools.fraud_tool import call_fraud_model

from tools.fraud_tool import call_fraud_model

欺诈检测工具 = 工具(

fraud_detection_tool = Tool(

名称="欺诈检测工具",

name="FraudDetectionTool",

func=call_fraud_model,

func=call_fraud_model,

描述="使用此功能检查电信索赔是否可能存在欺诈。提供诸如 IS_MISSING_MOBILE、HOUR_TO_RAISE_CLAIM 和 TOTAL_VERIFICATIONS 等结构化特征。"

description="Use this to check if a telecom claim is likely fraudulent. Provide structured features like IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM, and TOTAL_VERIFICATIONS."

)

该工具使 GenAI 代理能够在决策过程中调用该模型。

This tool enables the GenAI agent to invoke the model as part of its decision-making process.

通过 Ollama 与 Mistral 进行代理编排

Agent orchestration with Mistral via Ollama

脚本run_agent.py实现了一个 LangChain 代理,该代理执行以下操作:

The script run_agent.py implements a LangChain agent that does the following:

  • 使用 Ollama加载本地的米斯特拉尔模型。
  • Loads the local mistral model using Ollama.
  • 使用FraudDetectionTool初始化代理
  • Initializes an agent with the FraudDetectionTool.
  • 提交描述索赔的自然语言查询。
  • Submits a natural language query describing a claim.
  • 代理解析查询,生成结构化输入,调用工具,并返回结果。
  • The agent parses the query, generates structured input, calls the tool, and returns the result.

关键在于,handle_parsing_errors=True用于允许智能体从模糊的 LLM 输出中恢复,从而确保推理周期中的鲁棒性:

Critically, handle_parsing_errors=True is used to allow the agent to recover from ambiguous LLM output, ensuring robustness during reasoning cycles:

agent = initialize_agent(

agent = initialize_agent(

工具=[欺诈检测工具],

tools=[fraud_detection_tool],

llm=llm,

llm=llm,

agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,

agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,

verbose=True,

verbose=True,

handle_parsing_errors=True

handle_parsing_errors=True

)

响应结果会打印到终端上,显示解释后的输出和欺诈预测结果:

The response is printed to the terminal, showing the interpreted output and fraud prediction:

电脑屏幕上显示代码输出。该输出突出显示了使用 MISSING_NKAM、NUM_OF_DAYS_CLAIM 和 TOTAL_VERIFICATIONS 等特征检测欺诈性索赔的情况,并建议使用 FraudDetectionTool 验证索赔。

图 17.8:LLM 的代理输出

Figure 17.8: Agent output from LLM

收到用户指令后,LLM 会解析自然语言查询,识别出用户请求进行欺诈检查。它会从查询中提取相关特征,将其格式化为结构化的 JSON 有效负载,并调用一个工具将此输入发送到托管 XGBoost 模型的 FastAPI 服务。收到欺诈概率评分后,LLM 会解析结果,并根据已知的特征重要性(例如,缺少移动设备信息或提交时间异常)生成易于理解的解释。最后,它会向用户返回清晰易懂的对话式回复,并可选择性地建议后续操作,例如审核或拒绝。

After receiving a user instruction, the LLM interprets the natural language query to identify that a fraud check is requested. It extracts relevant features from the query, formats them into a structured JSON payload, and invokes a tool that sends this input to the FastAPI service hosting the XGBoost model. Upon receiving the fraud probability score, the LLM interprets the result and generates a human-readable explanation based on known feature importances (e.g., missing mobile or odd submission hours). Finally, it returns a clear, conversational response to the user, optionally suggesting next actions like review or rejection.

完整的代码可以在 GitHub 代码库中找到。

The end-to-end code can be found in the GitHub repository.

GenAI工作流程中机器学习模型集成的比较概述

Comparative overview of ML model integration in GenAI workflows

在混合型 GenAI/ML 系统中,集成传统模型(例如 CNN、分割模型、ANN 和 OCR)在数据类型、模型架构、部署复杂性和与 LLM 的交互方面存在差异。以下是它们在实现和集成方面可能存在的差异的比较概述:

In a hybrid GenAI/ML system, integrating traditional models like CNNs, segmentation models, ANNs, and OCR differs in terms of data types, model architecture, deployment complexity, and interaction with LLMs. The following is a comparative overview of how their implementation and integration may differ:

型号

Model type

用例

Use case

执行

Implementation

服务策略

Serving strategy

LLM整合

LLM integration

人工神经网络

ANNs

结构化/表格化任务(例如,客户流失预测、风险评分)。

Structured/tabular tasks (e.g., churn prediction, risk scoring).

数值和类别特征的预处理(缩放、编码);类似于 XGBoost 流水线。

Preprocessing of numerical and categorical features (scaling, encoding); resembles XGBoost pipelines.

封装成 API,接受向量并返回预测结果。

Wrapped as APIs that take vectors and return predictions.

LLM发送输入向量 | 接收预测 | 用自然语言解释结果。

LLM sends input vector | receives prediction | explains result in natural language.

CNN

CNNs

基于图像的任务(例如,分类、缺陷检测)

Image-based tasks (e.g., classification, defect detection)

图像预处理(调整大小、归一化);在带标签的图像数据集上进行训练。

Image preprocessing (resizing, normalization); trained on labeled image datasets.

通过 REST API 提供服务,接受图像文件(base64 或 URL);返回标签/概率。

Served via REST APIs accepting image files (base64 or URLs); returns labels/probabilities.

LLM 将用户查询编码为图像上传 + 元数据 | 调用 CNN | 解释结果(例如,检测到缺陷)。

LLM encodes user query into image upload + metadata | invokes CNN | interprets result (e.g., defect detected).

分割模型

Segmentation models

像素级分类(例如,医学影像、卫星数据)。

Pixel-wise classification (e.g., medical imaging, satellite data).

输出分割掩码;通常需要 GPU 支持的服务。

Outputs segmentation masks; often requires GPU-backed serving.

通过 TorchServe/TF Serving 和 GPU 提供服务;返回覆盖层/掩码。

Served via TorchServe/TF Serving with GPU; returns overlays/masks.

LLM 发送图像 + 上下文 | 接收掩模 | 解释分割区域(例如,肿瘤边界)。

LLM sends image + context | receives mask | explains segmented regions (e.g., tumor boundary).

OCR

OCR

从图像中提取文本(例如,收据、文档)。

Text extraction from images (e.g., receipts, documents).

使用 Tesseract 或 EasyOCR 等工具提取非结构化文本。

Uses tools like Tesseract or EasyOCR to extract unstructured text.

用作工具/API,从图像输入返回原始文本。

Served as a tool/API returning raw text from image input.

LLM 将 OCR 输出与语义推理相结合(例如,发票金额是多少?)。

LLM combines OCR output with semantic reasoning (e.g., what is the invoice amount?).

表 17.1:GenAI 工作流程中机器学习模型集成的比较概述

Table 17.1: Comparative overview of ML model integration in GenAI workflows

待办事项

To do

作为本章的实践延伸,您的任务是构建一个 LangChain 代理,该代理与基于图的推荐模型进行交互,这种模型常用于产品推荐、社交网络建议或内容发现等场景。首先,选择或实现一个使用图数据结构的推荐模型,例如 Node2Vec 的节点嵌入、个性化 PageRank 或图神经网络( GNN )。该模型应提供一个函数或 API 端点,该端点接受用户 ID 或物品 ID,并返回一个按排名排列的推荐节点列表。

As a practical extension of this chapter, your task is to build a LangChain agent that interfaces with a graph-based recommendation model, commonly used in scenarios like product recommendation, social network suggestions, or content discovery. Begin by selecting or implementing a recommendation model that uses graph data structures, such as node embeddings from Node2Vec, Personalized PageRank, or a graph neural network (GNN). The model should expose a function or API endpoint that accepts a user ID or item ID and returns a ranked list of recommended nodes.

接下来,通过定义描述、预期输入模式和输出行为,将此函数或 API 封装到 LangChain 工具中。然后,使用本地语言学习模型 (LLM)(例如,通过 Ollama 使用 Mistral)创建一个 LangChain 代理,该代理能够理解自然语言指令,例如“推荐与此商品相似的产品”“用户 123 接下来应该观看什么?” 。 LLM 应解析意图,提取用户或商品 ID,调用图推荐工具,并用通俗易懂的英语解释输出结果。这项任务强化了本章的关键概念——工具封装、代理编排和推理,并将它们应用于基于图的 AI 系统这一新的互补领域。

Next, wrap this function or API into a LangChain tool by defining a description, expected input schema, and output behavior. Then, use a local LLM (e.g., Mistral via Ollama) to create a LangChain agent that can interpret natural language instructions like suggest products similar to this item or what should user 123 watch next? The LLM should parse the intent, extract the user or item ID, call the graph recommendation tool, and explain the output in plain English. This task reinforces key concepts from the chapter—tool wrapping, agent orchestration, and reasoning, while applying them to a new but complementary domain of graph-based AI systems.

结论

Conclusion

本章演示了如何构建一个混合人工智能系统,该系统结合了传统 XGBoost 模型的预测能力和类似 Mistral 的语言学习模型(LLM)的推理和语言处理能力,并通过 Ollama 实现。我们首先使用 XGBoost 实现了一个稳健的欺诈检测流程,其中包括类别不平衡处理、特征选择、阈值调优和性能评估。训练好的模型和预处理组件使用 joblib 保存,以便后续进行推理。

In this chapter, we demonstrated how to build a hybrid AI system that combines the predictive power of a traditional XGBoost model with the reasoning and language capabilities of a LLM like Mistral via Ollama. We began by implementing a robust fraud detection pipeline using XGBoost, incorporating class imbalance handling, feature selection, threshold tuning, and performance evaluation. The trained model and preprocessing components were saved using joblib for downstream inference.

接下来,我们使用 Fast API 将模型部署为 REST API,从而实现实时预测。然后,我们构建了一个与 LangChain 兼容的工具,该工具调用此 API,并将其封装到一个由本地托管的 LLM 提供支持的推理代理中。该代理接收自然语言查询,提取结构化特征,调用 XGBoost 模型,使用预先计算的特征重要性解释结果,并提供易于理解的解释和建议。

Next, we deployed the model as a REST API using Fast API, enabling real-time predictions. We then constructed a LangChain-compatible tool that calls this API and wrapped it into a reasoning agent powered by a locally hosted LLM. This agent receives natural language queries, extracts structured features, invokes the XGBoost model, interprets the result using precomputed feature importances, and delivers human-readable explanations and recommendations.

我们还定义了清晰的项目文件夹结构、完整的用于可复现的需求文档(requirements.txt)以及流程图。最终成果是一个模块化、可解释且可扩展的人工智能系统,其中传统机器学习和基因人工智能协同工作,为实际应用提供智能欺诈检测和决策支持。

We also defined a clear project folder structure, a complete requirements.txt for reproducibility, and a process flowchart. The result is a modular, explainable, and scalable AI system where traditional ML and GenAI collaborate to provide intelligent fraud detection and decision support in real-world applications.

下一章,我们将介绍LLM 操作LLMOps )和 GenAI 评估技术。

In the next chapter, we will cover LLM operations (LLMOps) and GenAI evaluation techniques.

加入我们的 Discord 空间

Join our Discord space

加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:

Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:

https://discord.bpbonline.com

https://discord.bpbonline.com

二维码标志

C第18LLM操作和GenAI评估技术

CHAPTER 18LLM Operations and GenAI Evaluation Techniques

介绍

Introduction

这是在对众多生成式人工智能GenAI )系统进行实施和理解之后,涵盖检索增强生成RAG )、代理编排、多模态管道和优化框架等不同领域的最后一章。在本章的总结部分,我们将重点转向使这些智能系统可靠、可扩展且可用于生产环境的运维和评估基础架构。本章将深入探讨大型语言模型操作LLMOps )和RAGOps,这是一套用于管理基于LLM的应用程序在实际环境中生命周期的关键实践、工具和设计原则。您将探索RAG管道的部署、监控、可观测性、版本控制和自适应反馈循环等主题,以及确保LLM驱动产品的弹性、可追溯性和治理的策略。

This is the final chapter after implementing and understanding numerous generative AI (GenAI) systems across diverse domains, from retrieval-augmented generation (RAG) and agent orchestration to multimodal pipelines and optimization frameworks. In this concluding section, we will now shift our focus to the operational and evaluative backbone that makes these intelligent systems reliable, scalable, and production-ready. This chapter delves into large language model operations (LLMOps) and RAGOps, a critical set of practices, tools, and design principles for managing the lifecycle of LLM-based applications in real-world settings. You will explore topics such as deployment, monitoring, observability, versioning, and adaptive feedback loops for RAG pipelines, as well as strategies to ensure resilience, traceability, and governance in LLM-driven products.

除了卓越的运营之外,我们还采用生成人工智能(GenAI)评估技术,这对于衡量质量、相关性、准确性和用户一致性至关重要。传统指标往往难以捕捉生成系统的细微性能,因此我们引入了自动评估策略和人机协同HITL )评估策略。这包括评分机制、现代一致性指标、幻觉检测以及基于模型的评估框架。

Alongside operational excellence, we turn to GenAI evaluation techniques, which are essential for measuring quality, relevance, accuracy, and user alignment. Traditional metrics often fall short in capturing the nuanced performance of generative systems, so we introduce both automatic and human-in-the-loop (HITL) evaluation strategies. This includes scoring mechanisms as well as modern alignment metrics, hallucination detection, and model-grounded evaluation frameworks.

这些运营和评估基础共同使您能够在 GenAI 时代自信地从实验阶段过渡到企业级部署。

Together, these operational and evaluative foundations enable you to confidently move from experimentation to enterprise-grade deployment in the era of GenAI.

结构

Structure

本章我们将学习以下主题:

In this chapter, we will learn about the following topics:

  • 生产级 GenAI 应用中运维的重要性
  • Importance of Ops in production-grade GenAI applications
  • 比较LLM和RAG评估
  • Comparing LLM and RAG evaluations
  • RagOps
  • RagOps
  • 持续监测
  • Continuous monitoring
  • 可观测性平台
  • Observability platforms
  • 基于图增强的 RAG 推荐系统
  • Graph-enhanced RAG-based recommendation system
  • 现代软件开发中各种操作的比较
  • Comparison of various Ops in modern software development
  • 安装MFflow
  • Installation of MFflow

目标

Objectives

本章旨在介绍和阐述 RAGOps,这是一种用于在实际 GenAI 部署中实现 RAG 系统运维的结构化方法。本章探讨了运维在生产级 GenAI 应用中的重要性,区分了 LLM 和 RAG 评估方法,并强调了两者如何支持系统的持续可观测性和可靠性。读者将了解如何在开发和部署后阶段实施 RAGOps,如何利用核心可观测性平台,以及如何将这些概念应用于基于图增强的 RAG 推荐系统。一个实践练习将指导读者完成端到端的实施,并强调可扩展 GenAI 系统中可追溯性、监控和评估的必要性。

The objective of this chapter is to introduce and conceptualize RAGOps, a structured approach to operationalizing RAG systems in real-world GenAI deployments. It explores the significance of Ops in production-grade GenAI applications, differentiates between LLM and RAG evaluation methodologies, and emphasizes how both support continuous system observability and reliability. Readers will understand how to implement RAGOps during development and post-deployment phases, utilize core observability platforms, and apply these concepts in a graph-enhanced RAG-based recommendation system. A practical to do guides readers through end-to-end implementation, reinforcing the need for traceability, monitoring, and evaluation in scalable GenAI systems.

生产级 GenAI 应用中运维的重要性

Importance of Ops in production-grade GenAI applications

考虑一个实际应用案例:一个面向流媒体平台的个性化内容推荐系统,该系统采用基于图的模型,并结合LLM生成的摘要和基于RAG的用户查询解释。最初,该系统在实验室环境下运行良好,在受控输入测试中能够返回相关内容。然而,一旦部署到生产环境,就会出现诸多挑战,而运维(LLMOps和RAGOps)正是在此发挥关键作用。

Consider a real-world application (use case illustration): A personalized content recommendation system for a streaming platform that uses a graph-based model enriched with LLM-generated summaries and RAG-based user query interpretation. Initially, the system works well in the lab, returning relevant content when tested with controlled inputs. However, once deployed in production, several challenges emerge, and this is where Ops (LLMOps and RAGOps) becomes critical.

例如,随着用户流量的增加,由于数据检索链过长或应用程序编程接口( API ) 的速率限制,模型的延迟也会增加。如果没有适当的监控,这种性能下降可能难以察觉,从而降低用户体验( UX )。运维实践允许您设置延迟和吞吐量监控,以便在用户受到影响之前向工程团队发出异常警报。

For instance, as user traffic increases, the model's latency grows due to long retrieval chains or application programming interface (API) rate limits. Without proper monitoring, this slowdown could go unnoticed, degrading the user experience (UX). Ops practices allow you to setup latency and throughput monitoring, alerting the engineering team to anomalies before users are affected.

此外,用户行为会随时间推移而改变;新的音乐类型、俚语或热门话题可能会降低预训练词嵌入或图连接的相关性。如果没有自适应重训练流程或反馈感知索引,推荐结果就会过时。RAGOps 通过定期更新或实时反馈循环来确保向量存储和知识库定期刷新。

Additionally, user behavior may shift over time; new genres, slang, or trending topics could reduce the relevance of the pre-trained embeddings or graph connections. Without adaptive retraining pipelines or feedback-aware indexing, the recommendations will become stale. RAGOps ensures the vector store and knowledge base are refreshed regularly, either through scheduled updates or real-time feedback loops.

现在,想象一下幻觉突然激增,导致LLM生成不准确或无关的摘要。借助完善的评估和日志记录系统,运维团队能够识别模型响应何时偏离预期行为,并触发备用机制或标记以供审查。

Now, imagine a sudden spike in hallucinations, where the LLM generates inaccurate or irrelevant summaries. With a robust evaluation and logging system in place, Ops helps identify when model responses deviate from expected behavior and triggers a fallback mechanism or flags for review.

此外,版本控制和回滚机制至关重要。如果新的模型或图表更新导致质量下降,运维团队可以快速回滚到稳定版本,而不会中断整个系统。

Moreover, versioning and rollback mechanisms are essential. If a new model or graph update causes quality to drop, Ops allows teams to quickly revert to a stable version without disrupting the entire system.

RAGOps 的另一个关键方面是跟踪和管理嵌入。在生产系统中,嵌入不仅代表静态内容,还代表应用程序不断演进的知识库。文档的更改、用户偏好设置,甚至 LLM 的更新都会影响嵌入的质量和相关性。如果没有嵌入版本控制和日志记录,就很难追踪检索失败的原因或 LLM 生成离题响应的原因。运维实践支持嵌入元数据日志记录、时间戳和集合版本控制,使团队能够审核特定查询使用了哪些嵌入、它们的生成时间以及它们是否与最新内容一致。这种可追溯性对于调试、合规性和检索层的持续改进至关重要。

Another critical aspect of RAGOps is tracking and managing embeddings. In production systems, embeddings represent not just static content, but the evolving knowledge base of your application. Changes to documents, user preferences, or even updates in the LLM can affect embedding quality and relevance. Without embedding version control and logging, it is difficult to trace why retrievals are failing or why the LLM is generating off-topic responses. Ops practices enable embedding metadata logging, timestamping, and collection versioning, allowing teams to audit which embeddings were used for a specific query, when they were generated, and whether they align with the latest content. This traceability is essential for debugging, compliance, and continual improvement of the retrieval layer.

因此,如果没有 LLMOps 和 RAGOps,即使是最具创新性的 GenAI 应用也面临大规模失败的风险。运维确保可靠性、可观测性、治理和持续改进,将原型转化为值得信赖、生产级的解决方案,并持续创造价值。

So, without LLMOps and RAGOps, even the most innovative GenAI applications risk failure at scale. Ops ensure reliability, observability, governance, and continuous improvement, transforming a prototype into a trustworthy, production-grade solution that consistently delivers value.

比较LLM和RAG评估

Comparing LLM and RAG evaluations

在开发 RAG 系统时,必须理解 LLM 评估和 RAG 评估之间的区别,因为两者针对的是整个流程的不同组成部分。虽然两者都旨在评估质量、相关性和性能,但它们侧重于系统的不同阶段,并且需要不同的技术和指标。

When developing a RAG system, it is essential to understand the distinction between LLM evaluation and RAG evaluation, as each targets different components of the overall pipeline. While both aim to assess quality, relevance, and performance, they focus on different stages of the system and require different techniques and metrics.

LLM评估

LLM evaluation

语言模型评估是指评估语言模型在给定输入提示的情况下生成准确、流畅且符合语境的响应的能力。该评估通常在模型选择、微调或验证阶段进行,主要关注以下方面:

LLM evaluation refers to assessing the language model’s ability to generate accurate, fluent, and contextually appropriate responses, given an input prompt. This evaluation is typically performed during model selection, fine-tuning, or validation stages and focuses on:

  • 流畅性和语法输出结果语法是否正确,是否像人类的表达方式?
  • Fluency and grammar: Are the outputs grammatically correct and human-like?
  • 连贯性生成的文本是否遵循逻辑结构?
  • Coherence: Does the generated text follow a logical structure?
  • 事实准确性输出结果是基于事实,还是臆想出来的?
  • Factual accuracy: Is the output grounded in truth, or is it hallucinated?
  • 相关性LLM是否恰当地回应了题目的意图?
  • Relevance: Does the LLM respond appropriately to the intent of the prompt?

常用的评估方法包括:

Common evaluation methods include:

  • 双语评估参考文本 (BLEU) :衡量生成的输出与参考文本之间的 n-gram 精确度重叠,常用于机器翻译。
  • Bilingual Evaluation Understudy (BLEU): Measures n-gram precision overlap between the generated output and a reference text, commonly used in machine translation.
  • 面向回忆的概要评估辅助工具(ROUGE):通过比较 n-gram 重叠或最长公共子序列来关注回忆,常用于摘要任务。
  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE): Focuses on recall by comparing n-gram overlaps or longest common subsequences, often used in summarization tasks.
  • 用于评估具有明确排序的翻译的指标 (METEOR) :通过结合词干提取、同义词匹配和词序灵活性,在 BLEU 的基础上进行了改进,平衡了精确率和召回率。
  • Metric for Evaluation of Translation with Explicit ORdering (METEOR): Improves on BLEU by incorporating stemming, synonym matching, and word order flexibility, balancing precision and recall.
  • 来自 Transformers 的双向编码器表示得分 (BERTScore) :使用上下文相关的 BERT 嵌入来计算生成文本和参考文本之间的语义相似性,从而更好地与人类判断保持一致。
  • Bidirectional Encoder Representations from Transformers Score (BERTScore): Uses contextualized BERT embeddings to compute semantic similarity between generated and reference texts, offering better alignment with human judgment.

其他方法包括人工评估质量等级和基于提示的单元测试,以评估推理、总结或幻觉倾向。

Other methods include human evaluation for quality ratings and prompt-based unit tests to assess reasoning, summarization, or hallucination tendencies.

LLM 评估对于了解模型在没有外部检索上下文的情况下独立运行的性能至关重要。

LLM evaluation is crucial for understanding how the model performs in isolation, without the external retrieved context.

RAG 评估

RAG evaluation

相比之下,RAG 评估侧重于完整的检索+生成流程,衡量系统检索相关文档并利用这些文档生成基于上下文的答案的效率。它包含以下几个层级:

RAG evaluation, by contrast, focuses on the complete retrieval + generation pipeline, measuring how effectively the system retrieves relevant documents and uses them to generate grounded, context-aware answers. It involves several layers, which are as follows:

  • 检索质量
    • 召回率@k/精确率@k :衡量检索到的前 k 个文档中有多少是相关的。
    • 嵌入漂移跟踪:它监控当前嵌入对不断发展的知识库的表示程度。
    • 覆盖面和多样性:它决定检索到的数据集是否提供了充分且多样化的事实依据。
  • Retrieval quality:
    • Recall@k/Precision@k: It measures how many of the top-k retrieved documents are relevant.
    • Embedding drift tracking: It monitors how well the current embeddings represent the evolving knowledge base.
    • Coverage and diversity: It determines if the retrieved set provides sufficient and diverse factual grounding.
  • 世代的脚踏实地
    • 忠实于源文件:确保生成的输出与检索到的文档一致。
    • 上下文使用情况:它评估 LLM 将检索到的内容融入响应的程度。
  • Generation groundedness:
    • Faithfulness to source: It ensures the generated output aligns with retrieved documents.
    • Context usage: It evaluates how well the LLM incorporates retrieved content into responses.
  • 管道级指标
    • 精确匹配 (EM)F1 分数:它常用于 QA 类型的任务。
    • 上下文感知 BERTScore :它通过将比较建立在检索到的文档之上来扩展 BERTScore。
    • 基于 LLM 的评分者:他们使用单独的模型,根据检索到的上下文来评估事实一致性和连贯性。
  • Pipeline-level metrics:
    • Exact match (EM) and F1 score: It is often used for QA-style tasks.
    • Context-aware BERTScore: It extends BERTScore by grounding the comparison in the retrieved documents.
    • LLM-based graders: They use separate models to evaluate factual consistency and coherence based on retrieved context.

RAG评估还强调日志记录和可追溯性,追踪检索到的文档、使用的嵌入版本以及提示信息的生成方式。这有助于对系统故障进行根本原因分析,并实现持续改进。

RAG evaluation also emphasizes logging and traceability, tracking which documents were retrieved, which embedding version was used, and how prompts were formed. This enables root cause analysis of system failures and continual improvement.

区分的重要性

Importance of distinction

LLM 评估有助于判断模型的独立性能,而 RAG 评估则评估整个流程在实际应用中的表现。一个模型单独使用时可能生成完美的答案,但如果与糟糕的检索策略结合使用,则可能失效。反之,即使检索结果很好,但如果 LLM 不匹配,也可能导致结果出现偏差。因此,为了确保 RAG 系统可靠且达到生产级标准,必须同时进行这两种评估。

While LLM evaluation helps you judge the standalone capabilities of your model, RAG evaluation assesses how well the entire pipeline performs in practice. A model might generate perfect answers in isolation but fail when paired with poor retrieval. Conversely, great retrieval with a misaligned LLM could lead to hallucinations. Therefore, both types of evaluation must be conducted independently and jointly to ensure a reliable, production-grade RAG system.

评估是GenAI运维的核心

Evaluation as the core of GenAI Ops

在生产级 GenAI 系统中,尤其是在 RAG 架构中,评估不仅仅是一项开发活动,更是运维(LLMOps 和 RAGOps)的关键组成部分。这些评估是监控质量、诊断故障、维护系统完整性和实现持续改进的基础。

In production-grade GenAI systems, especially RAG architectures, evaluation is not just a development activity; it is a critical component of Ops (LLMOps and RAGOps). These evaluations serve as the foundation for monitoring quality, diagnosing failures, maintaining system integrity, and enabling continuous improvement.

确保大规模输出质量

Ensuring output quality at scale

在生产环境中,用户期望很高;每一条响应都必须切题、流畅且有理有据。评估使您能够使用自动化指标(例如 BLEU、ROUGE 和 BERTScore)和 HITL 反馈系统来量化质量。这些评估对于建立质量基线、定义可接受的性能阈值以及检测性能随时间推移而发生的下降至关重要。

In production, user expectations are high; every response must be relevant, fluent, and grounded. Evaluations enable you to quantify quality using both automated metrics (like BLEU, ROUGE, and BERTScore) and HITL feedback systems. These evaluations are essential for establishing quality baselines, defining acceptable performance thresholds, and detecting degradation over time.

例如,如果在模型更新后,实时 A/B 测试中的 BERTScore 或 METEOR 指标下降,运维团队可以触发回滚或将流量路由到更稳定的版本。这种持续的评估循环确保模型更新不会悄无声息地降低用户体验。

For example, if BERTScore or METEOR drops in real-time A/B tests after a model update, Ops teams can trigger rollbacks or route traffic to a more stable version. This continuous evaluation loop ensures that model updates do not silently degrade UX.

监测漂移和幻觉

Monitoring drift and hallucinations

LLMOps 必须考虑概念漂移,即模型性能因用户行为、词汇或上下文变化而下降。评估有助于及早发现这种漂移。例如,幻觉率的上升(可通过忠实度指标或基于 LLM 的指标来衡量)就是一个潜在的问题。验证器可以指出检索到的文档已过时、不相关或与用户查询不符。

LLMOps must account for concept drift, where the model’s performance decays due to changing user behavior, vocabulary, or context. Evaluations help detect this drift early. For instance, a rise in hallucination rates, measured through faithfulness metrics or LLM-based verifiers, can indicate that retrieved documents are outdated, irrelevant, or misaligned with user queries.

通过不断评估生成结果的可靠性,RAGOps 系统可以跟踪输出结果何时偏离检索到的文档,并触发自动索引刷新、嵌入重新生成或重新训练计划。

By continuously evaluating generation groundedness, RAGOps systems can track when outputs deviate from retrieved documents and trigger automatic index refresh, embedding re-generation, or retraining schedules.

评估检索质量以进行预先调试

Evaluating retrieval quality for preemptive debugging

RAG 系统严重依赖向量存储和知识库。即使 LLM 运行正常,检索质量差通常也是输出结果不佳的根本原因。诸如 Recall@k、嵌入相似度得分和覆盖率得分等评估指标可以实时反映检索效果。

RAG systems rely heavily on vector stores and knowledge bases. Poor retrieval quality is often the root cause of bad outputs, even if the LLM is functioning correctly. Evaluation metrics like Recall@k, Embedding Similarity Score, and Coverage Score provide real-time insight into retrieval effectiveness.

通过可视化这些指标的操作仪表盘,团队可以识别低召回率查询、不相关的文档匹配或新内容的冷启动问题。此类评估有助于优化检索器、及时进行工程调整,或在无需重新训练 LLM 的情况下嵌入索引重建。

Operational dashboards that visualize these metrics allow teams to identify low-recall queries, irrelevant document hits, or cold-start issues with new content. Such evaluations enable retriever tuning, prompt engineering adjustments, or embedding index regeneration without needing to retrain the LLM.

支持版本控制和可追溯性

Supporting version control and traceability

LLM 和 RAG 评估均支持运维层面的可追溯性。在复杂的 GenAI 系统中,能够追踪哪个版本的检索器、嵌入模型或 LLM 生成了特定答案,对于合规性、审计和调试至关重要。评估日志可作为结构化证据,证明特定管道版本在部署前已满足所需的性能标准。

Both LLM and RAG evaluations support Ops-level traceability. In complex GenAI systems, being able to track which version of a retriever, embedding model, or LLM produced a specific answer is critical for compliance, audits, and debugging. Evaluation logs act as structured evidence that a given pipeline version met required performance standards before deployment.

这些评估也可以用于持续集成和持续部署( CI/CD ) 管道,其中测试时 BLEU、ROUGE 或 F1 答案的任何下降都会阻止生产部署,直到问题得到解决。

These evaluations can also be used in continuous integration and continuous deployment (CI/CD) pipelines, where any drop in test-time BLEU, ROUGE, or answer F1 blocks production deployment until the issue is resolved.

反馈回路和自愈系统

Feedback loops and self-healing systems

高级 GenAI Ops 融合了反馈感知重训练基于人类反馈的强化学习( RLHF )。评估指标为这些反馈循环提供信号,使系统能够从用户评分、点击量或更正中学习。

Advanced GenAI Ops incorporates feedback-aware retraining and Reinforcement Learning from Human Feedback (RLHF). Evaluation metrics provide the signal for these feedback loops, enabling the system to learn from user ratings, click-throughs, or corrections.

例如,如果用户对某个答案的评价很低,评估系统可以将其与检索到的文档进行比较,并标记出这是检索问题还是生成问题。这种针对性的洞察支持精细化的优化,而不仅仅是通用的重新训练。

For instance, if a user rates an answer poorly, evaluations can compare it against retrieved documents and flag whether it is a retrieval issue or a generation issue. This targeted insight supports fine-grained optimization, not just generic retraining.

在 GenAI 运维中,评估指标是可观测性工具;它们能够实时展现系统行为、检测故障、指导回滚、提供重新训练信息,并实现智能自动化。如果没有强大的 LLM 和 RAG 评估,运维团队实际上就像盲人摸象,只能被动地应对用户投诉,而无法主动确保系统的可靠性和可信度。

In GenAI Ops, evaluation metrics are observability tools; they expose system behavior in real-time, detect faults, guide rollbacks, inform retraining, and enable intelligent automation. Without robust LLM and RAG evaluation, Ops teams are effectively blind, reacting to user complaints instead of proactively ensuring system reliability and trustworthiness.

RAGOps

RAGOps

RAGOps 指的是应用于 RAG 系统整个生命周期的运维实践、工具和监控策略。它涵盖了关键组件的评估、跟踪和优化,例如嵌入生成、文档检索、重排序、提示构建和语言模型输出。RAGOps 确保系统在开发和生产环境中保持准确性、可靠性和高性能。通过集成可观测性、版本控制、反馈循环和自动化评估,RAGOps 使团队能够检测偏差、减少误判并确保与用户意图保持一致。最终,RAGOps 对于构建基于检索增强架构的可扩展、可信赖且持续改进的 GenAI 应用至关重要。

RAGOps refers to the operational practices, tools, and monitoring strategies applied to RAG systems across their lifecycle. It encompasses the evaluation, tracking, and optimization of key components such as embedding generation, document retrieval, reranking, prompt construction, and language model outputs. RAGOps ensures that systems remain accurate, grounded, and performant in both development and production environments. By integrating observability, versioning, feedback loops, and automated evaluation, RAGOps enables teams to detect drift, reduce hallucinations, and maintain alignment with user intent. Ultimately, RAGOps is essential for building scalable, trustworthy, and continuously improving GenAI applications based on retrieval-enhanced architectures.

在 RAG 系统的开发和后期开发阶段,RAGOps 在确保质量、可靠性和可追溯性方面发挥着至关重要的作用。在开发阶段,它能够通过指标和可观测性工具,对嵌入、检索精度、提示构建和生成接地性进行系统评估。后期开发阶段,RAGOps 的重点转向生产环境中的监控、漂移检测、实时故障跟踪和反馈集成。通过在整个生命周期中应用 RAGOps 实践,团队可以主动解决问题、执行质量基准并支持持续改进,从而将 RAG 系统从实验原型转变为可扩展、可靠的解决方案,并可随时部署到实际环境中。

During both the development and post-development phases of a RAG system, RAGOps plays a vital role in ensuring quality, reliability, and traceability. During development, it enables systematic evaluation of embeddings, retrieval accuracy, prompt construction, and generation groundedness through metrics and observability tools. Post-development, RAGOps shifts focus on monitoring, drift detection, real-time failure tracking, and feedback integration in production environments. By applying RAGOps practices throughout the lifecycle, teams can proactively address issues, enforce quality benchmarks, and support continuous improvement, transforming RAG systems from experimental prototypes into scalable, dependable solutions ready for real-world deployment.

在开发过程中

During development

以下列表概述了 RAGOps 在开发过程中的目标、实践和目的:

The following list outlines the objectives, practices, and goals of RAGOps during development:

  • 目标:构建一个稳健、可测试且高质量的 RAG 流水线。
  • Objective: Build a robust, testable, and high-quality RAG pipeline.
  • RAGOps 实践
    • 嵌入质量分析和版本控制。
    • 使用合成查询或黄金标准查询进行检索精确率/召回率测试。
    • 提示格式验证和令牌使用日志记录。
    • 生成输出中的接地感和幻觉检测。
    • 迭代重排序器调优和分阶段评估
    • Ragas LangfuseArize Phoenix等工具集成,进行跟踪级评估。
  • RAGOps practices:
    • Embedding quality analysis and version control.
    • Retrieval precision/recall testing with synthetic or gold-standard queries.
    • Prompt formatting validation and token usage logging.
    • Groundedness and hallucination detection in generated outputs.
    • Iterative reranker tuning and stage-wise evaluation
    • Integration with tools like Ragas, Langfuse, or Arize Phoenix for trace-level evaluation.
  • 目标:在部署之前建立强大的可观测性、可追溯性和指标基线。
  • Goal: Establish strong observability, traceability, and metric baselines before deploying.

由于 RAG 系统的多组件性、非确定性和模块化特性,在开发过程中识别和评估 RAGOps 本身就非常复杂。要在开发过程中实现稳健的可观测性和评估,需要采用结构化的方法。

Identifying and benchmarking RAGOps during development is inherently complex due to the multi-component, non-deterministic, and modular nature of RAG systems. Achieving robust observability and evaluation during development requires a structured approach.

开发过程中 RAGOps 的识别

Identification in RAGOps during development

识别是 RAGOps 的基础阶段。它专注于发现 RAG 流程中可能发生故障的关键点,并确定究竟需要跟踪哪些内容,以确保质量、可靠性和可追溯性。

Identification is the foundational phase of RAGOps. It focuses on discovering the key points in the RAG pipeline where failures may occur and establishing what exactly needs to be tracked to ensure quality, reliability, and traceability.

首先,必须将 RAG 系统分解为各个组成部分:嵌入创建、检索、可选重排序、提示构建和生成。对于每个阶段,开发人员都必须识别可能出现的问题以及这些问题的具体表现形式。

To begin with, the RAG system must be decomposed into its constituent components: embedding creation, retrieval, optional reranking, prompt construction, and generation. For each of these stages, developers must identify what could go wrong and how those failures would manifest.

例如,在词嵌入创建过程中,低质量的向量表示会导致检索结果不佳。因此,监控词嵌入的偏差、确保文档覆盖率以及跟踪索引更新的及时性至关重要。开发人员还应检查词嵌入是否反映了内容的当前状态,或者是否使用了过时的向量。

For instance, during embedding creation, low-quality vector representations can lead to poor retrieval results. Therefore, it is important to monitor embedding drift, ensure complete document coverage, and track the timeliness of index updates. Developers should also examine whether embeddings reflect the current state of the content or if stale vectors are in use.

检索阶段经常因语义不匹配或前k个结果排名不合理而失败。要识别这些问题,需要跟踪检索到的文档与用户查询的相关性。这涉及到检查检索到的文档与真实值或预期值之间的重叠情况。

The retrieval stage often fails due to semantic mismatch or inadequate top-k ranking. Identifying issues here requires tracking how often retrieved documents are relevant to user queries. This involves examining the overlap between retrieved documents and ground truth or expectations.

在包含重排序层的系统中,会引入额外的复杂性。故障可能包括对真正相关的文档进行错误排序,或者在重复运行中出现不稳定。这就需要跟踪重排序对检索顺序的影响程度,以及它是否能改善下游的生成。

In systems with reranking layers, additional complexity is introduced. Failures may include misranking of truly relevant documents or instability across repeated runs. This requires tracking how much reranking changes the retrieval order and whether it improves downstream generation.

提示构建是另一个需要格外注意的步骤,格式错误或过长的提示会导致语言模型接收到的输入被截断或错位。识别此类问题需要监控提示模板的一致性、词元长度和格式错误。

Prompt construction is another sensitive step, where malformed or overlong prompts can lead to truncated or misaligned inputs to the language model. Identifying such issues requires monitoring prompt template consistency, token length, and formatting errors.

最后,在生成阶段,幻觉和不连贯的输出是最常见的问题。开发人员必须确定模型是否忠实地使用了检索到的内容,并避免生成虚假信息。这需要检查生成的输出与源文档之间的一致性。

Finally, in the generation phase, hallucinations and incoherent outputs are the most common issues. Developers must identify whether the model faithfully uses the retrieved content and avoids producing fabricated information. This entails inspecting the alignment between the generated output and the source documents.

因此,识别是一个诊断过程。它通过揭示 RAG 管道的哪些部分脆弱以及哪些指标或信号指示这些脆弱性,为可观测性奠定了基础。

Identification is, therefore, a diagnostic process. It sets the stage for observability by exposing which parts of the RAG pipeline are fragile and which metrics or signals are indicative of those fragilities.

在 RAGOps 开发过程中进行基准测试

Benchmarking in RAGOps during development

一旦确定了关键跟踪点,下一步就是进行基准测试,为每个组件定义量化基线和质量标准。

Once critical tracking points are identified, the next step is benchmarking, defining quantitative baselines and quality standards for each component.

基准测试始于创建黄金标准评估数据集。由于在早期开发阶段无法获得真实用户数据或真实用户数据不具代表性,因此该数据集通常是该系统由人工筛选或系统自动生成的查询组成,每个查询都与已知的相关文档和预期输出相匹配。这种受控设置使得系统性能能够以一致且可重复的方式进行评估。

Benchmarking begins with the creation of a gold-standard evaluation dataset. Since live user data is unavailable or unrepresentative during early development, this dataset is typically composed of manually curated or synthetically generated queries, each paired with known relevant documents and expected outputs. This controlled setup allows the system’s performance to be measured in a consistent, repeatable manner.

接下来,对于 RAG 流程的每个阶段,开发者必须定义相应的评估指标。例如,检索阶段使用 Recall@k 和 precision@k 等指标来评估成功检索到的相关文档数量。对于模型生成阶段,则使用 BERTScore、忠实度评分和幻觉率等指标来评估语义正确性和扎根性。

Next, for each stage of the RAG pipeline, developers must define the appropriate evaluation metrics. For example, the retrieval stage is evaluated using metrics such as Recall@k and precision@k to assess how many relevant documents are successfully retrieved. For generations, metrics such as BERTScore, faithfulness score, and hallucination rate have been used to assess semantic correctness and groundedness.

系统使用基准数据集进行测试后,所得分数将被记录为基线值。这些分数作为参考点,用于比较后续迭代流程的性能,以判断是否存在退步或改进。基准测试不仅仅是收集数据,更重要的是确定可接受的标准。这涉及到定义容差阈值。例如,要求检索召回率不低于某个特定值,或者幻觉发生率保持在设定的最大值以下。

Once the system is tested against the benchmark dataset, the resulting scores are recorded as baseline values. These scores serve as reference points, allowing future iterations of the pipeline to be compared for regression or improvement. Benchmarking is not just about collecting numbers but about establishing what is acceptable. This involves defining tolerance thresholds. For example, requiring that retrieval recall does not fall below a certain value or that hallucination rates remain under a defined maximum.

至关重要的是,基准测试必须与版本控制相结合。每个基准测试结果都必须与嵌入模型、向量索引、提示模板或重排序器的特定版本相关联。这确保了观察到的性能变化可以追溯到流程中的特定修改。

Crucially, benchmarking must also be tied to version control. Each benchmark result must be associated with a specific version of the embedding model, vector index, prompt template, or reranker. This ensures that observed changes in performance can be traced back to specific modifications in the pipeline.

基准测试的结束不仅在于记录基线分数,更在于将其整合到开发工作流程中。这可以采取测试期间的手动检查清单或 CI/CD 流水线中的自动化关卡等形式。其目标是确保 RAG 流水线的每个组件在投入生产之前都符合最低性能标准。

Benchmarking concludes when not only have baseline scores been recorded, but when they are integrated into the development workflow. This can take the form of manual checklists during testing or automated gates in a CI/CD pipeline. The objective is to ensure that every component of the RAG pipeline adheres to minimum performance standards before being considered ready for production.

识别提供了可观测性框架,明确了需要观察的内容以及可能出现问题的地方,而基准测试则设定了量化参考标准,以及系统必须达到的良好程度才能满足运行标准。它们共同构成了 RAGOps 在开发过程中的核心,确保系统稳健、可解释,并能够在运行约束下不断演进。现在,让我们重点关注开发后的阶段。

Identification provides the observability structure, what to watch, and where issues might arise, while benchmarking sets the quantitative reference and how good the system must be to meet operational standards. Together, they form the core of RAGOps during development, ensuring that the system is robust, interpretable, and ready to evolve under operational constraints. Now, let us focus on the post-development phase.

表 18.1RAGOps 开发过程中的跟踪”概述了需要跟踪的内容、可能出现故障的位置,以及在 RAG 开发流程的每个阶段可以使用的指标或工具。本指南旨在帮助您在部署前将可观测性和评估集成到开发工作流程中。

The following Table 18.1, RAGOps tracking during development, outlines what to track, where failures may arise, and which metrics or tools can be used at each stage of the RAG development pipeline. This serves as a practical guide to integrate observability and evaluation into your development workflow before deployment.

阶段

Stage

关键故障点

Key failure points

追踪什么

What to track

指标/工具

Metrics/Tools

数据摄取和嵌入创建

Data ingestion and embedding creation

嵌入内容质量低劣或已过时,文档缺失,以及格式问题。

Low-quality or outdated embeddings, missing documents, and format issues.

嵌入质量、文档数量、格式有效性和嵌入漂移。

Embedding quality, document count, format validity, and embedding drift.

嵌入相似度、覆盖率 %) LangChain 日志、Facebook AI 相似性搜索( Faiss ) 索引统计信息。

Embedding similarity, coverage %, LangChain logs, Facebook AI Similarity Search (Faiss) index stats.

检索

Retrieval

无关的前 k 个结果、检索延迟、回忆效果差。

Irrelevant top-k results, retrieval latency, poor recall.

召回率@k、精确率@k、查询延迟、检索文档重叠率。

Recall@k, precision@k, query latency, retrieved document overlap.

召回@k,查询时间,Langfuse/Arize Phoenix 日志。

Recall@k, query time, Langfuse/Arize Phoenix logs.

重新排名(如果使用)

Reranking (if used)

排名错误、评分嘈杂、上下文不匹配。

Incorrect ranking, noisy scoring, context mismatch.

得分差异性、前1相关性和排名稳定性。

Score divergence, top-1 relevance, and rank stability.

重排序得分方差、相关性指标和评估轨迹。

Reranker score variance, correlation metrics, and evaluation traces.

快速建设

Prompt construction

提示信息过长、格式错误、词法单元被截断。

Overlong prompts, incorrect formatting, token cutoff.

提示长度、提示模板一致性、截断率。

Prompt length, prompt-template consistency, truncation rate.

令牌长度日志、提示模板版本。

Token length logs, prompt-template versions.

一代

Generation

出现幻觉、语无伦次、无视语境。

Hallucinations, incoherence, context ignoring.

接地感、流畅性、幻觉发生率、LLM 日志。

Groundedness, fluency, hallucination rate, LLM logs.

BERTcore、幻觉检查器、WhyLabs、Langfuse 痕迹。

BERTScore, hallucination checker, WhyLabs, Langfuse traces.

评估

Evaluation

主观质量问题,缺乏反馈机制。

Subjective quality issues, no feedback loop.

人性化评价、真实性、产出忠实度。

Human ratings, groundedness, output faithfulness.

BLEU、ROUGE、BERTScore、Ragas、人工注释日志。

BLEU, ROUGE, BERTScore, Ragas, human annotation logs.

表 18.1:RAGOps 在开发过程中的跟踪

Table 18.1: RAGOps tracking during development

后期开发

Post-development

以下列表重点介绍了 RAGOps 在后期开发阶段的目标、实践和目的:

The following list highlights the objectives, practices, and goals of RAGOps during post-development:

  • 目标:在用户负载下保持可靠性、追踪错误并确保实时质量。
  • Objective: Maintain reliability, trace errors, and ensure real-time quality under user load.
  • RAGOps 实践
    • 持续监测检索和生成性能
    • 嵌入和知识库内容中的漂移检测
    • 实时提示和输出跟踪,用于故障诊断
    • 代理系统或混合系统中日志记录工具/工具链的使用情况
    • 现场幻觉和接地检查
    • 反馈循环收集(用户评分、点击率等)
  • RAGOps practices:
    • Continuous monitoring of retrieval and generation performance
    • Drift detection in embeddings and knowledge base content
    • Real-time prompt and output tracing for failure diagnosis
    • Logging tool/toolchain usage in agentic or hybrid systems
    • Live hallucination and grounding checks
    • Feedback loop collection (user ratings, click-through, etc.)
  • 目标:通过可观察性和自动化反馈循环实现韧性、稳定性和适应性。
  • Goal: Achieve resilience, stability, and adaptability through observability and automated feedback loops.

识别后期发展

Identify post-development

在实践中,首先要确定要跟踪哪些指标,然后根据这些跟踪指标设定基准。因此,流程如下:

In practice, you first identify what to track, and then you set benchmarks based on those tracked metrics. So, the sequence is as follows:

1. 确定要跟踪的内容(即,定义与系统目标一致的关键指标)。

1. Identify what to track (i.e., define key metrics aligned with your system's goals).

2. 建立基准(即,为这些指标建立基准值和可接受的阈值)。

2. Establish benchmarks (i.e., establish baseline values and acceptable thresholds for those metrics).

18.2 RAG 系统故障跟踪表,结构化地总结了不同 RAG 系统类型的故障点、识别策略、指标和跟踪方法。您可以在系统评估和部署期间将其用作诊断和操作参考。

Table 18.2, RAG system failure tracking table, presents a structured summary of failure points, identification strategies, metrics, and tracking methods across different RAG system types. You can use it as a diagnostic and operational reference during system evaluation and deployment.

RAG系统

RAG system

关键故障点

Key failure points

识别策略

Identification strategy

跟踪方法

Tracking method

指标

Metrics

单级 RAG

Single-stage RAG

低质量的嵌入、不相关的检索、虚假的生成。

Low-quality embeddings, irrelevant retrieval, hallucinated generation.

Recall@k、BERTScore、幻觉分析。

Recall@k, BERTScore, hallucination analysis.

嵌入日志、检索重叠、接地检查。

Embedding logs, retrieval overlap, grounding checks.

召回率@k、精确率@k、BERTScore、幻觉率。

Recall@k, precision@k, BERTScore, hallucination rate.

两阶段 RAG

Two-stage RAG

初始检索效果差、重排序效果差、上下文不匹配。

Weak initial retrieval, poor reranking, context mismatch.

首次回忆与重新排序回忆、重新排序得分分析。

First vs. reranked recall, reranker score analysis.

中间文档日志,重排序器元数据。

Intermediate document logs, reranker metadata.

Recall@k(重排序前/后)、重排序器得分分布、忠实度。

Recall@k (pre/post rerank), reranker score distribution, faithfulness.

多阶段 RAG

Multi-stage RAG

错误传播、过度过滤、重排序器冲突。

Error propagation, excessive filtering, reranker conflict.

分阶段消融,整体存在分歧。

Stage-wise ablation, ensemble disagreement.

阶段日志、重排序器版本控制。

Stage logs, reranker versioning.

分阶段回忆、整体一致性得分和上下文利用。

Stage-wise recall, ensemble agreement score, and context utilization.

多模态 RAG

Multimodal RAG

模态错位、融合不良、输出无依据。

Modality misalignment, poor fusion, ungrounded outputs.

跨模态相似性、注意力图分析。

Cross-modal similarity, attention map analysis.

模态特定日志、融合轨迹和漂移监测。

Modality-specific logs, fusion trace, and drift monitoring.

CLIP 相似度、VQAScore、跨模态 BERTScore、图像标题 BLEU。

CLIP similarity, VQAScore, cross-modal BERTScore, image caption BLEU.

RAG 中的传统工具

Traditional tool in RAG

工具误用、误解和 API 故障。

Tool misuse, misinterpretation, and API failure.

行动与观察不匹配,模式验证。

Action-observation mismatch, schema validation.

工具调用日志,提示版本控制。

Tool call logs, prompt versioning.

工具调用准确率、模式匹配率、工具错误率。

Tool invocation accuracy, schema match rate, tool error rate.

代理 RAG

Agentic RAG

计划循环、无效的工具链和目标不一致。

Planning loops, invalid toolchains, and goal misalignment.

痕迹一致性,链有效性检查。

Trace coherence, chain validity checks.

完整的跟踪日志,工具错误跟踪。

Full trace logs, tool error tracking.

代理计划有效性、行动观察一致性、步骤准确性。

Agent plan validity, action-observation alignment, step accuracy.

基于图的RAG

Graph-based RAG

图稀疏/不相关,遍历错误。

Sparse/irrelevant graph, traversal errors.

图指标,节点相关性评分。

Graph metrics, node relevance scoring.

遍历日志,边权重跟踪。

Traversal logs, edge weight tracking.

图覆盖率、节点中心性、边相关性得分。

Graph coverage, node centrality, edge relevance score.

文本到 SQL RAG

Text-to-SQL RAG

架构错误、SQL无效、执行失败。

Wrong schema, invalid SQL, execution failure.

SQL语法验证、执行测试。

SQL syntax validation, execution testing.

模式日志、查询结果比较……

Schema logs, query result comparison..

SQL 有效性、执行准确率、模式对齐得分。

SQL validity rate, execution accuracy, schema alignment score.

基于OCR的RAG

OCR-based RAG

OCR识别不准确,布局分类错误。

OCR inaccuracy, layout misclassification.

OCR置信度,文本-视觉比较。

OCR confidence, text-visual comparison.

OCR日志、检索准确性审核。

OCR logs, retrieval accuracy audits.

OCR置信度评分、文本提取准确率和检索精度。

OCR confidence score, text extraction accuracy, and retrieval precision.

表 18.2:RAG 系统故障跟踪表

Table 18.2: RAG system failure tracking table

RAG系统开发后的基准测试

Benchmarking in RAG systems post-development

在成功开发并初步部署 RAG 系统之后,工作重点转向在实际运行环境中维护系统的可靠性、质量和运行连续性。虽然实时监控、日志记录和用户反馈在生产环境中发挥着重要作用,但基准测试仍然是 RAGOps 框架内一项基本的开发后实践。开发后基准测试确保系统行为始终符合其最初目标,检测出潜在的退化问题,并支持可追溯的质量保证。具体细节如下:

Following the successful development and initial deployment of a RAG system, the focus shifts toward maintaining reliability, quality, and operational continuity in a live environment. While real-time monitoring, logging, and user feedback play an important role in production, benchmarking remains a fundamental post-development practice within the broader RAGOps framework. Post-development benchmarking ensures that system behavior remains aligned with its original objectives, detects silent regressions, and supports traceable quality assurance. The details are as follows:

  • 开发后基准测试的目的和作用:在开发后阶段,基准测试发挥着双重作用:它既是回归安全网,也是基准性能验证器。与实时监控(捕获实时指标和系统状态)不同,基准测试使用固定数据集进行受控评估。这使得系统性能能够以可重复、可解释和标准化的方式进行评估。

    这种区别至关重要。在动态的生产环境中——查询分布不断变化、索引不断更新、外部系统波动——基准测试提供了一个稳定、不变的参考标准,可以用来衡量系统性的变化。如果没有这种控制,团队只能通过嘈杂、无标签且不断变化的实时数据来解读模型性能,这使得找出性能下降的根本原因变得困难。

  • The purpose and role of benchmarking post-development: In post-development settings, benchmarking serves a dual role: it functions as a regression safety net and as a baseline performance validator. Unlike live monitoring, which captures real-time metrics and system state, benchmarking offers a controlled evaluation using a fixed dataset. This allows system performance to be assessed in a repeatable, interpretable, and standardized way.

    This distinction is critical. In a dynamic production environment—where query distributions evolve, indices are updated, and external systems fluctuate—benchmarking offers a stable, invariant reference against which systemic changes can be measured. Without this control, teams are left to interpret model performance through noisy, unlabeled, and ever-changing live data, making it difficult to isolate the causes of performance degradation.

  • 黄金标准数据集的持续重要性:即使在生产环境中,由代表性查询、精心整理的答案和经过验证的相关文档组成的黄金标准数据集仍然至关重要。这些数据集是评估检索器、重排序器、嵌入模型或生成组件更新性能的基础。

    黄金标准数据能够实现以下目标:

    • 对系统变化进行可重复评估
    • 不同车型版本之间的比较
    • 可靠地计算诸如召回率@k、忠实度和幻觉率等指标

      这些数据集的结构是固定的,因此可以对性能进行纵向跟踪。此外,企业还可以逐步添加从真实用户数据中提取的高质量、经人工验证的示例,从而实现一种混合基准测试方法,该方法能够随着生产需求的演变而发展,同时又不牺牲可靠性。

  • Continued relevance of gold-standard datasets: Even in production, gold-standard datasets, composed of representative queries, curated answers, and validated relevant documents, remain essential. These datasets serve as the foundational substrate for performance evaluation across updates to retrievers, rerankers, embedding models, or generation components.

    Gold-standard data enables the following:

    • Repeatable evaluations of system changes
    • Comparison across model versions
    • Reliable computation of metrics such as recall@k, faithfulness, and hallucination rate

      These datasets are frozen in structure, allowing performance to be tracked longitudinally. Moreover, organizations can incrementally augment them with high-quality, human-verified examples derived from live user data—thus enabling a hybrid benchmarking approach that evolves with production needs without sacrificing reliability.

  • 基准测试作为运营保障的推动因素:RAGOps 的主要目标之一是确保系统性能符合预定义的服务级别目标( SLO )。基准测试是验证这些保障的机制。例如,如果一个组织承诺将幻觉率维持在 7% 以下或将最低回忆率 (5% 回忆率) 维持在 70% 以下,那么这些阈值必须通过系统性的基准测试来验证,而不能仅仅根据实时流量推断。

    在生产工作流程中,这通常表现为定期基准测试评估(例如,CI/CD 流水线中的夜间运行或部署前检查)。性能指标基于基准数据集计算,并与历史基线进行比较,如果超出可接受的容差范围,则会发出警告。

  • Benchmarks as enablers of operational guarantees: One of the principal goals of RAGOps is to ensure that system performance adheres to predefined service-level objectives (SLOs). Benchmarks act as the mechanism for validating these guarantees. For example, if an organization commits to maintaining a hallucination rate below 7% or a minimum recall@5 of 70%, these thresholds must be verified through systematic benchmarking, not inferred solely from live traffic.

    In production workflows, this often takes the form of scheduled benchmark evaluations (e.g., nightly runs or pre-deployment checks in a CI/CD pipeline). Performance metrics are computed on the benchmark dataset, compared to historical baselines, and flagged if they fall outside acceptable tolerances.

  • 基准测试作为合规性、可追溯性和调试工具:除了性能监控之外,开发后的基准测试还能提供额外的运营优势。在受监管或高风险领域(例如医疗保健、金融、法律科技),基准测试支持以下方面:
    • 可审计性:证明系统行为在特定时间点符合已验证的质量标准
    • 可追溯性:将模型输出与特定系统版本和配置关联起来
    • 根本原因分析:通过将实际行为与基准预期进行比较,调试运行系统中的故障。

    在诊断性能下降时,这一点尤为重要,因为基准测试是暴露于实时用户和数据变化中的系统中唯一不变的基准。

  • Benchmarking as a tool for compliance, traceability, and debugging: Beyond performance monitoring, post-development benchmarks offer additional operational advantages. In regulated or high-stakes domains (e.g., healthcare, finance, legal tech), benchmarking supports the following:
    • Auditability: Demonstrating that system behavior adhered to validated quality standards at a given point in time
    • Traceability: Linking model outputs to specific system versions and configurations
    • Root cause analysis: Debugging failures in live systems by comparing behavior against benchmarked expectations

    This is especially critical when diagnosing performance drops, as benchmarks provide the only invariant baseline in a system exposed to real-time user and data variability.

  • 生产环境中基准测试流程的演进:虽然为了保持一致性,黄金标准数据集必须保持不变,但开发后的阶段也能从自适应基准测试策略中获益。这些策略包括:
    • 定期对实时查询进行人工标注,以扩展基准语料库
    • 影子评估,即在不影响用户界面输出的情况下,并行测试新系统变体在基准查询上的性能。
    • 基准测试切片,即将基准测试的子集与特定的用户群体、查询类型或领域进行匹配,以便进行更精细的诊断。
    • 即使系统及其运行环境不断发展,这些做法也能使基准测试保持相关性和响应性。
  • Evolving the benchmarking process in production: While gold-standard datasets must remain static for consistency, the post-development phase also benefits from adaptive benchmarking strategies. These include:
    • Periodic human annotation of live queries to expand the benchmark corpus
    • Shadow evaluations, where new system variants are tested on benchmark queries in parallel without affecting user-facing output
    • Benchmark slicing, where subsets of the benchmark are aligned to specific user segments, query types, or domains for more granular diagnostics
    • These practices allow benchmarking to remain relevant and responsive, even as the system and its operating environment continue to evolve.

在开发后阶段进行基准测试不仅可行,而且是 RAGOps 的关键支柱。它提供了在复杂多变的生产环境中监控系统健康状况所需的客观性、稳定性和可解释性。通过将静态的黄金标准数据集与实时系统洞察相结合,组织可以确保 RAG 系统保持可靠性、可解释性并与运营目标保持一致。因此,基准测试发挥着持续验证机制的作用。

Benchmarking during the post-development phase is not only feasible, it is a critical pillar of RAGOps. It provides the objectivity, stability, and interpretability needed to monitor system health in complex and volatile production settings. By combining static, gold-standard datasets with live system insights, organizations can ensure that RAG systems remain reliable, explainable, and aligned with operational goals. Benchmarking thus acts as the continuous validation mechanism.

持续监测

Continuous monitoring

基准测试虽然为验证 RAG 系统性能是否符合已知标准提供了一个稳定且可重复的框架,但其本质上是静态且周期性的。相比之下,持续监控能够实时运行,使系统利益相关者能够观察、评估并响应生产环境中出现的性能变化。持续监控是开发后 RAG 运维的重要组成部分,有助于提高运行可靠性、用户信任度、系统弹性以及基于反馈的改进。

While benchmarking provides a stable and repeatable framework for validating RAG system performance against known standards, it is inherently static and periodic. In contrast, continuous monitoring operates in real-time, enabling system stakeholders to observe, evaluate, and respond to performance variations as they unfold in production environments. Continuous monitoring is an essential component of post-development RAGOps, facilitating operational reliability, user trust, system resilience, and feedback-driven improvement.

实时 RAG 系统中的持续监控

Continuous monitoring in live RAG systems

生产中的 RAG 系统处于动态且往往不可预测的环境中,详情如下:

RAG systems in production are exposed to a dynamic and often unpredictable environment, details as follows:

  • 随着用户行为的演变,查询分布也会发生变化。
  • Query distributions change as user behavior evolves.
  • 知识库或向量库得到更新或扩展。
  • Knowledge bases or vector stores are updated or extended.
  • 检索和生成性能可能会随时间推移而下降。
  • Retrieval and generation performance may drift over time.
  • 上游数据源、API 或检索器可能出现不一致的情况。
  • Upstream data sources, APIs, or retrievers may become inconsistent.

在这种情况下,仅仅依靠周期性基准测试是不够的。持续监控能够提供实时可观测性,确保偏差、故障或退化能够及早被发现和诊断,通常在用户察觉之前就能解决。

In such settings, relying solely on periodic benchmarks is insufficient. Continuous monitoring provides real-time observability, ensuring that deviations, failures, or regressions are detected and diagnosed early, often before they become user-visible.

RAGOps 中需要监控的关键指标

Key metrics to monitor in RAGOps

RAG 系统中有效的持续监测必须能够捕获不同管道组件的一系列指标。这些指标包括以下几项:

Effective continuous monitoring in RAG systems must capture a range of metrics across different pipeline components. These include the following:

  • 检索级别监控
    • 回忆率近似值(通过点击率或代理标签)
    • 检索延迟和响应时间
    • 前k个文档相似性和多样性
    • 嵌入漂移检测(例如,随时间变化的余弦相似度)
  • Retrieval-level monitoring:
    • Recall approximation (via click-through rates or proxy labels)
    • Retrieval latency and response time
    • Top-k document similarity and diversity
    • Embedding drift detection (e.g., cosine similarity over time)
  • 发电级监测
    • 幻觉风险指标(例如,接地信心低)
    • 提示词计数和截断率
    • 反应的连贯性、流畅性和长度分布
    • 故障模式:输出为空、重复或无关
  • Generation-level monitoring:
    • Hallucination risk indicators (e.g., low grounding confidence)
    • Prompt token count and truncation rates
    • Response coherence, fluency, and length distribution
    • Failure modes: empty, repetitive, or irrelevant outputs
  • 系统级监控
    • 端到端延迟(检索+生成)
    • 模型版本和提示模板使用情况日志记录
    • 查询量、失败率和用户交互指标
    • 工具执行失败(针对智能体或工具增强型 RAG)
  • System-level monitoring:
    • End-to-end latency (retrieval + generation)
    • Model version and prompt template usage logging
    • Query volume, failure rates, and user interaction metrics
    • Tool execution failures (for agentic or tool-augmented RAG)

这些指标提供了基础设施健康状况和LLM质量两个维度的运营可见性。

These metrics provide operational visibility across both infrastructure health and LLM quality dimensions.

连续监测的技术和工具

Techniques and tools for continuous monitoring

RAG 系统中用于实施持续监控的工具和方法多种多样,具体如下:

A wide range of tools and methodologies are used to implement continuous monitoring in RAG systems, which are as follows:

  • 可观测性平台:Langfuse、Arize Phoenix 和 WhyLabs 等工具提供端到端的跟踪、提示日志记录和评估仪表板。
  • Observability platforms: Tools like Langfuse, Arize Phoenix, and WhyLabs provide end-to-end tracing, prompt-logging, and evaluation dashboards.
  • 日志记录和跟踪:详细的日志记录查询输入、检索到的文档、提示和最终输出,对于事后分析至关重要。
  • Logging and tracing: Detailed logs capturing query inputs, retrieved documents, prompts, and final outputs are essential for post-hoc analysis.
  • 自定义评估器:可以部署基于 LLM 或基于规则的评分器,实时对输出的忠实性、合理性或连贯性进行评分。
  • Custom evaluators: LLM-based or rule-based graders can be deployed to score outputs for faithfulness, groundedness, or coherence in real-time.
  • 漂移检测模型:向量漂移和标记分布监控可以检测嵌入或提示结构何时开始偏离学习的规范。
  • Drift detection models: Vector drift and token distribution monitoring can detect when embeddings or prompt structures start to deviate from learned norms.

将这些系统集成到生产流程中,可以实现实时反馈循环、警报机制和回滚策略。

Integration of these systems into the production pipeline enables real-time feedback loops, alerting mechanisms, and rollback strategies.

警报、仪表盘和异常检测

Alerting, dashboards, and anomaly detection

成熟的监控系统包含基于阈值的警报功能,当关键指标超出预定义的可接受范围(在基准测试期间设定)时,该系统会通知相关人员。例如:

A mature monitoring system includes threshold-based alerting, which notifies stakeholders when key metrics fall outside predefined acceptable ranges (as set during benchmarking). For instance:

  • 平均幻觉风险评分飙升
  • A spike in the average hallucination risk score
  • 检索召回率代理指标突然下降
  • A sudden drop in the retrieval recall proxy
  • LLM 潜伏期或提示截断频率意外增加
  • An unexpected increase in LLM latency or prompt truncation frequency

实时仪表盘可持续显示此类指标,并通过时间序列可视化和跟踪比较实现根本原因分析。

Real-time dashboards provide continuous visibility into such metrics and enable root cause analysis through time-series visualizations and trace comparisons.

反馈回路和自愈系统

Feedback loop and self-healing systems

持续监测不仅是被动的,更是构建自愈和自适应系统的基础。当与主动学习循环、用户反馈或强化信号相结合时,监测到的输出可以提供以下信息:

Continuous monitoring is not only reactive; it is the foundation for building self-healing and adaptive systems. When paired with active learning loops, user feedback, or reinforcement signals, monitored outputs can inform:

  • 对重新排序者或检索者的动态再训练
  • Dynamic retraining of rerankers or retrievers
  • 对检索到的上下文进行重新加权
  • Re-weighting of retrieved contexts
  • 为过期内容重新生成嵌入。
  • Re-generation of embeddings for stale content
  • 更新提示或模型选择策略
  • Updates to prompts or model selection strategies

持续监控是开发后 RAG 运维中不可或缺的一部分。它确保已部署的系统保持运行质量,快速响应变化,并随着时间的推移不断适应变化。通过捕获检索、生成和基础设施层面的实时信号,并将这些信号转化为可执行的洞察,监控将 RAG 系统从静态部署转变为动态的、可学习的、持续运行的应用,使其在动态的生产环境中保持稳健、可靠,并始终与其预期用途保持一致。

Continuous monitoring is a non-negotiable aspect of post-development RAGOps. It ensures that deployed systems maintain operational quality, respond rapidly to changes, and adapt over time. By capturing real-time signals across retrieval, generation, and infrastructure layers, and translating these signals into actionable insights, monitoring transforms RAG systems from static deployments into living, learning applications that remain robust, trustworthy, and aligned with their intended purpose in dynamic production environments.

可观测性平台

Observability platforms

要大规模部署 RAG 系统,可观测性是可靠性的基石。目前,丰富的平台生态系统提供追踪、评估、漂移监控和接地诊断等功能,每个平台在 RAGOps 系统中都扮演着独特的角色。以下是一些塑造这一领域的核心工具。

To operationalize RAG systems at scale, observability is the backbone of reliability. A rich ecosystem of platforms now provides tracing, evaluation, drift monitoring, and grounding diagnostics, each filling a distinct role in the RAGOps stack. Below are some of the core tools shaping this space.

核心可观测性平台

Core observability platforms

Langfuse、Arize Phoenix、WhyLabs 和 MLflow 等基础平台为 RAGOps 提供追踪、评估、漂移监控和提示/版本管理功能。这些是提供全栈可见性的骨干系统。详情如下:

Foundational platforms like Langfuse, Arize Phoenix, WhyLabs, and MLflow that provide tracing, evaluation, drift monitoring, and prompt/version management for RAGOps. These are the backbone systems that give full-stack visibility. The details are as follows:

  • Langfuse :一款功能强大的开源 LLM/RAG 可观测性套件,提供完整的跟踪日志记录、提示管理、提示级延迟/成本指标和评估分析。它可与 Ragas 等工具集成,并支持OpenTelemetry ( OTEL ) 插桩。
  • Langfuse: A powerful open-source LLM/RAG observability suite offering full trace logging, prompt management, prompt-level latency/cost metrics, and evaluation analytics. It integrates with tools like Ragas and supports OpenTelemetry (OTEL) instrumentation.
  • Arize Phoenix :一个专注于LLM流程追踪、聚类分析和RAG特定诊断的开源平台。虽然在实验和评估方面表现出色,但与Langfuse相比,其提示管理功能不够全面。
  • Arize Phoenix: An open-source platform focused on LLM pipeline tracing, cluster analysis, and RAG-specific diagnostics. While strong in experimentation and evaluation, it lacks comprehensive prompt management compared to Langfuse.
  • WhyLabs :一套强大的漂移监控和可观测性工具集,提供 RAG 特有的功能,例如检索一致性跟踪、接地指标以及对幻觉、PII 或提示注入的安全监控。
  • WhyLabs: A robust drift-monitoring and observability toolset that offers RAG-specific capabilities like retrieval consistency tracking, grounding metrics, and security monitoring for hallucinations, PII, or prompt injections.
  • MLflow :MLflow 为 GenAI 系统(尤其是 RAG 流水线)提供全面的支持,包括端到端追踪、自动化评估、提示和版本管理以及企业级部署。MLflow 的追踪功能只需一行代码即可捕获 LLM 调用、检索、工具使用情况、延迟和上下文元数据。其评估框架包含 LLM 作为评判者以及启发式指标,用于评估正确性、相关性、幻觉风险和安全性。内置的提示注册表支持版本化的无代码提示工程。MLflow 支持统一的部署和治理,从而实现跨开发和生产的持续质量保证、回滚控制和可追溯的性能监控。
  • MLflow: MLflow offers comprehensive support for GenAI systems—especially RAG pipelines—by providing end-to-end tracing, automated evaluation, prompt and version management, and enterprise-grade deployment. With a single line of instrumentation, MLflow’s tracing captures LLM calls, retrievals, tool usage, latency, and contextual metadata Its evaluation framework includes LLM-as-judge and heuristic metrics to assess correctness, relevance, hallucination risks, and safety The built-in prompt registry enables versioned, no-code prompt engineering. MLflow supports unified deployment and governance, enabling continuous quality assurance, rollback control, and traceable performance monitoring across development and production.

RAG专用评估库

RAG-specific evaluation libraries

以下列表概述了诸如 Ragas 和 LlamaIndex 可观测性模块等工具,这些工具专注于合成测试数据生成、基础评估以及为 RAG 流水线提供无缝集成。它们通过 RAG 相关的评估指标来补充核心平台。

The following list outlines tools such as Ragas and LlamaIndex observability module that focus on synthetic test-data generation, grounding evaluation, and seamless instrumentation for RAG pipelines. They complement the core platforms with RAG-focused evaluation metrics.

  • Ragas :一个用于生成合成 RAG 测试数据和进行无参考流水线评估(例如,忠实度、幻觉评分)的开源库。它可与 Langfuse 和 Phoenix 无缝集成。
  • Ragas: An open-source library for synthetic RAG test-data generation and reference-free pipeline evaluation (e.g., faithfulness, hallucination scoring). It seamlessly integrates with Langfuse and Phoenix.
  • LlamaIndex 可观测性模块:为使用 LlamaIndex 构建的 RAG 管道提供内置检测功能,从而实现与 Phoenix 或其他可观测性平台的一键集成。
  • LlamaIndex observability module: Provides built-in instrumentation for RAG pipelines structured with LlamaIndex, enabling one-click integration with Phoenix or other observability platforms.

辅助工具和生态系统集成

Auxiliary tools and ecosystem integrations

支持性组件包括OTEL RAGVizInspectorRAGet增强了分布式追踪、可视化和人工与算法混合评估功能。这些功能扩展了可观测性堆栈,以实现更专业的诊断。

Supporting pieces like OTEL, RAGViz, and InspectorRAGet, which enhance distributed tracing, visualization, and hybrid human + algorithmic evaluation. These extend the observability stack for more specialized diagnostics.

  • OTEL 检测:许多工具利用 OTEL 实现 RAG 流水线的分布式追踪。Arize 和 Langfuse 等项目使用 OTEL 的变体进行日志收集和追踪导出。
  • OTEL instrumentation: Many tools leverage OTEL for distributed tracing across RAG pipelines. Projects like Arize and Langfuse use variants for log collection and trace export.
  • RAGViz(研究工具):一个开源诊断工具,可可视化检索上下文中的文档级和词元级注意力——有助于分析基础性能和检索错误。
  • RAGViz (research tool): An open-source diagnostic tool that visualizes document-level and token-level attention over retrieved contexts—helpful in analyzing grounding performance and retrieval errors.
  • InspectorRAGet :一个用于 RAG 评估的自省平台,它结合了人工和算法指标来分析管道级性能、错误案例和评估质量。
  • InspectorRAGet: An introspection platform for RAG evaluation that combines human and algorithmic metrics to analyze pipeline-level performance, error cases, and evaluation quality.

这些工具提供端到端的 RAG 可观测性,从提示级别的追踪和生成评估到接地诊断和漂移检测。您可以组合不同的平台(例如,使用 Langfuse 和 Ragas 进行评估,使用 WhyLabs 进行漂移检测,以及使用 OTEL 进行追踪),构建一个稳健的、生产级的 RAGOps 堆栈,以满足您系统的架构和领域需求。基于此,我们来讨论一个基于图的推荐引擎。以下架构的端到端代码可以在 GitHub 代码库中找到。

These tools provide end-to-end RAG observability, from prompt-level tracing and generation evaluation to grounding diagnostics and drift detection. You can combine platforms (e.g., Langfuse with Ragas for evaluation, WhyLabs for drift, and OTEL for tracing) to construct a robust, production-grade RAGOps stack tailored to your system’s architecture and domain requirements. With this understanding, let us discuss a graph-based recommendation engine. The end-to-end code of the following architecture can be found in the GitHub repository.

有了这种可观测性基础,我们现在可以将重点转移到基于图的推荐引擎,探索这些监控原则如何扩展到智能检索和推荐管道。

With this foundation in observability, we can now shift focus to a graph-based recommendation engine, exploring how these monitoring principles extend into intelligent retrieval and recommendation pipelines.

基于图增强的 RAG 推荐系统

Graph-enhanced RAG-based recommendation system

以下架构展示了一个模块化且可扩展的 RAG 推荐系统流程,该系统集成了结构化产品数据、用户偏好、基于图的关系以及神经排序技术。该系统利用 LangChain 进行流程编排,Faiss 进行向量索引,NetworkX 进行图表示,以及基于 Transformer 的嵌入模型进行语义匹配。

The following architecture represents a modular and extensible RAG pipeline for a recommendation system that integrates structured product data, user preferences, graph-based relationships, and neural ranking techniques. The system leverages LangChain for orchestration, Faiss for vector indexing, NetworkX for graph representation, and transformer-based embedding models for semantic matching.

流程图展示了一个推荐系统流程,该流程使用 Llama Index、文本嵌入、向量数据库、编排器、Ollama 的 Mistral、LangChain 和 XAI 混合搜索,向用户提供重新排序的推荐。

图 18.1 基于图增强的 RAG 推荐架构

Figure 18.1: Graph-enhanced RAG-based recommendation architecture

数据摄取管道

Data ingestion pipeline

设计一个有效的推荐引擎不仅仅是简单的检索,它需要一个多阶段的流程,将语义搜索、基于图的推理和个性化整合起来。以下架构概述了完整的流程,从将原始数据转换为嵌入和结构化图,到协调混合检索、重排序和自然语言生成,最终生成面向用户的推荐内容:

Designing an effective recommendation engine requires more than simple retrieval, it demands a multi-stage pipeline that unifies semantic search, graph-based reasoning, and personalization. The following architecture outlines the complete flow, from transforming raw data into embeddings and structured graphs, to orchestrating hybrid retrieval, reranking, and natural language generation for user-facing recommendations:

  • 转换为文本格式:来自产品目录和历史用户偏好日志这两个数据源的表格数据被转换为适合嵌入的文本表示形式。这使其能够与基于语言模型的向量编码器兼容。
  • Conversion to textual format: Tabular data from two sources, the product catalogue and historical user preference logs, is transformed into textual representations suitable for embedding. This enables compatibility with language model-based vector encoders.
  • 文本分块:将文本数据分割成语义上有意义的块。这一步骤对于捕捉局部上下文和提高检索粒度至关重要。
  • Text chunking: The textual data is segmented into semantically meaningful chunks. This step is critical for capturing localized context and improving retrieval granularity.
  • 嵌入生成:该系统使用 Ollama-all-miniLM-L6-v2 模型计算目录条目和用户偏好的稠密向量嵌入。这些嵌入作为检索操作中语义相似性的基础。
  • Embedding generation: The system uses the Ollama-all-miniLM-L6-v2 model to compute dense vector embeddings for both catalogue entries and user preferences. These embeddings serve as the basis for semantic similarity in retrieval operations.
  • 图构建:使用 NetworkX 从目录数据生成结构化图。该图捕获产品之间的关系,例如相似性、类别层次结构或共现模式,并作为向量搜索的补充检索方式。
  • Graph construction: A structured graph is generated from the catalogue data using NetworkX. This graph captures relationships between products, such as similarity, category hierarchies, or co-occurrence patterns, and serves as a complementary retrieval modality alongside vector search.

检索和推荐流程

Retrieval and recommendation pipeline

此阶段包括查询处理、混合搜索、结果排名和自然语言响应生成,详情如下:

This phase involves query handling, Hybrid search, result ranking, and natural language response generation, details as follows:

  • 查询处理和嵌入:接收用户查询并使用相同的 Ollama-all-miniLM-L6-v2 模型进行嵌入,以保持向量空间与索引文档的一致性。
  • Query processing and embedding: A user query is received and embedded using the same Ollama-all-miniLM-L6-v2 model to maintain vector space consistency with the indexed documents.
  • LangChain 基于代理的编排:LangChain 代理可编排跨三个不同工具的混合检索:
    • 使用 Faiss 进行向量搜索。
    • 使用基于 NetworkX 的图进行图搜索。
    • 结合两种检索信号的混合搜索。

    该代理还可以访问索引嵌入和图结构,以进行实时决策。

  • LangChain agent-based orchestration: A LangChain agent orchestrates hybrid retrieval across three distinct tools:
    • Vector search using Faiss.
    • Graph search using the NetworkX-based graph.
    • Hybrid search that combines both retrieval signals.

    This agent also accesses indexed embeddings and graph structures for real-time decision-making.

  • 检索与个性化:代理基于语义相似性、图遍历和用户偏好匹配度检索前 k 个候选结果。用户偏好被转换为嵌入,并包含在评分过程中,以支持个性化排名。
  • Retrieval and personalization: The agent retrieves top-k candidates based on semantic similarity, graph traversal, and user preference alignment. User preferences are converted to embeddings that are included in the scoring process to support personalized ranking.
  • 使用交叉编码器进行重排序:使用 ms-marco-MiniLM-L6-v2 交叉编码器对初始检索到的候选结果进行重排序。该模型基于查询条目和候选条目之间的成对比较,执行细粒度的相关性评分。
  • Reranking with cross-encoder: The initially retrieved candidates are reranked using the ms-marco-MiniLM-L6-v2 cross-encoder. This model performs fine-grained relevance scoring based on pairwise comparisons between the query and candidate entries.
  • 自然语言包装:最终重新排序的结果通过 Ollama 平台传递给 LLM,以生成流畅、易于理解的响应,从而对推荐内容进行上下文解释。
  • Natural language wrapping: The final, reranked results are passed to an LLM (via the Ollama platform) to produce a fluent, human-readable response that contextualizes and explains the recommendations.
  • 响应交付:系统向用户返回最终推荐结果,其中包含语义、结构和个人相关性,所有内容均以自然语言呈现,以提高用户参与度。
  • Response delivery: The system returns a finalized recommendation to the user, incorporating semantic, structural, and personal relevance, all wrapped in natural language for improved user engagement.

关键技术包括:

Key technologies include:

  • LangChain :工具编排和基于代理的推理。
  • LangChain: Tool orchestration and agent-based reasoning.
  • Faiss :用于高效近似最近邻ANN )搜索的向量搜索引擎
  • Faiss: Vector search engine for efficient approximate nearest neighbor (ANN) search.
  • NetworkX :面向结构感知检索的图构建和遍历。
  • NetworkX: Graph construction and traversal for structure-aware retrieval.
  • Ollama-all-miniLM-L6-v2 :用于嵌入的句子转换器模型。
  • Ollama-all-miniLM-L6-v2: Sentence Transformer model for embeddings.
  • Ms-marco-MiniLM-L6-V2 :用于重排序的交叉编码器模型。
  • Ms-marco-MiniLM-L6-V2: Cross-encoder model for reranking.
  • Ollama LLM :自然语言生成推荐输出。
  • Ollama LLM: Natural language generation of recommendation output.

该系统展示了一种稳健的 RAG 架构,并结合了基于图的推理和用户偏好建模。通过整合多种检索方式(语义、结构和个性化),并利用重排序器和语言模型进行最终输出,该系统确保了推荐结果的高度相关性和可解释性。此外,该设计还支持模块化,使其能够适应各种需要产品推荐或内容检索的领域。

This system exemplifies a robust RAG architecture enhanced with graph-based reasoning and user preference modeling. By integrating multiple retrieval modalities (semantic, structural, and personalized) and leveraging a reranker and language model for final output formulation, it ensures high relevance and interpretability in recommendations. The design also supports modularity, enabling adaptability across various domains where product recommendation or content retrieval is required.

系统中的智能 RAG 设计和多工具检索

Agentic RAG design and multi-tool retrieval in the system

该架构的一个显著特点是其智能体设计,它能够在检索和推荐流程中实现智能决策和动态工具选择。系统并非依赖静态的操作序列,而是将控制权委托给一个基于 LangChain 的智能体,该智能体能够执行推理驱动的检索工作流。这种以智能体为中心的方法为流程引入了灵活性、模块化和可解释性,从而能够根据查询复杂性和用户上下文实现自适应交互。

A distinguishing characteristic of this architecture is its agentic design, which enables intelligent decision-making and dynamic tool selection within the retrieval and recommendation pipeline. Rather than relying on a static sequence of operations, the system delegates control to a LangChain-powered agent that is capable of executing a reasoning-driven retrieval workflow. This agent-centric approach introduces flexibility, modularity, and explainability into the pipeline, facilitating adaptive interactions based on query complexity and user context.

代理控制回路

Agentic control loop

在运行时,代理接收用户查询,并自主决定调用哪些检索工具、调用顺序以及如何组合检索结果。这种规划-执行模式使代理能够:

At runtime, the agent receives the user query and autonomously determines which retrieval tools to invoke, in what order, and how to combine the results. This planning-execution paradigm allows the agent to:

  • 根据输入语义动态选择检索策略。
  • Select retrieval strategies dynamically based on input semantics.
  • 整合并协调来自多个来源的输出结果。
  • Integrate and reconcile outputs from multiple sources.
  • 利用检索到的证据,制定结构化的生成提示。
  • Formulate structured prompts for generation using retrieved evidence.

该代理还会解释中间观察结果(例如,部分检索结果),并可根据条件重新运行工具,从而增强其进行复杂决策的能力。

The agent also interprets intermediate observations (e.g., partial retrieval results) and can conditionally rerun tools, enhancing its capacity for complex decision-making.

三种互补的检索工具

Three complementary retrieval tools

代理控制下的检索系统包含三个专门的工具,每个工具都针对不同的相关性维度,具体如下:

The retrieval system under the agent’s control comprises three specialized tools, each addressing a different dimension of relevance, which are as follows:

  • 向量搜索工具(语义检索):该工具使用 Ollama-all-miniLM-L6-v2 生成的密集向量嵌入,并使用人工神经网络搜索引擎 Faiss 对其进行索引。该工具的目标是检索与用户查询语义相似的文档或目录条目,即使它们使用了不同的术语。它在捕捉概念相似性和释义意图方面尤其有效。
  • Vector search tool (semantic retrieval): This tool uses dense vector embeddings generated via Ollama-all-miniLM-L6-v2 and indexes them with Faiss, an ANN search engine. The goal of this tool is to retrieve documents or catalogue entries that are semantically similar to the user's query, even if they use different terminology. It is particularly effective for capturing conceptual similarity and paraphrased intent.
  • 图搜索工具(结构化检索):该工具基于 NetworkX 构建的图,该图由目录元数据、关系和类别层级构成。它支持结构化遍历,使代理能够识别在逻辑上或关系上与用户查询上下文接近的条目。图搜索在知识图谱中邻近性(例如,共享标签、依赖关系、共现)是相关性强有力信号的领域中尤为重要。
  • Graph search tool (structural retrieval): This tool operates over a NetworkX-based graph, constructed from catalogue metadata, relationships, and category hierarchies. It supports structured traversal, enabling the agent to identify items that are logically or relationally proximate to the user's query context. Graph search is especially valuable in domains where proximity in a knowledge graph (e.g., shared tags, dependencies, co-occurrence) is a strong signal of relevance.
  • 混合搜索工具(聚合检索):混合搜索工具作为元检索器,融合了向量检索和图检索的信号。它可以应用启发式算法、加权评分或排名融合策略,生成整合后的前k个结果集。这种混合方法利用了向量检索的语义丰富性和图检索的结构精确性,从而提高了鲁棒性和覆盖范围。
  • Hybrid search tool (aggregated retrieval): The hybrid search tool serves as a meta-retriever, combining signals from both vector and graph search results. It may apply heuristics, weighted scoring, or rank fusion strategies to produce a consolidated top-k result set. This hybrid approach leverages the semantic richness of vector retrieval and the structural precision of graph-based retrieval, allowing for improved robustness and coverage.

代理人的运作角色

Operational role of the agent

该智能体将这些工具集成到一个推理循环中,并利用了 LangChain 的 ReAct 式框架。它并非简单地按预定义的顺序执行工具;相反,它:

The agent integrates these tools into a reasoning loop, leveraging LangChain’s ReAct-style framework. It does not simply execute tools in a predefined order; rather, it:

  • 分析查询。
  • Analyzes the query.
  • 决定优先处理语义信号、结构信号还是混合信号。
  • Decides whether to prioritize semantic, structural, or hybrid signals.
  • 从一个或多个工具中检索结果。
  • Retrieves results from one or more tools.
  • 使用学习到的相关性标准对答案进行重新排序和重新表述。
  • Reranks and reformulates responses using learned relevance criteria.

这种智能协调确保系统能够根据查询类型(例如,事实性、关系性、个性化)、领域结构和用户偏好来调整检索策略。

This agentic orchestration ensures that the system can adapt retrieval strategies based on query type (e.g., factual, relational, personalized), domain structure, and user preferences.

表 18.3标题为“代理 RAG 系统的开发后故障点和指标” ,详细列出了潜在故障点以及应在系统的每个主要组件中监控的相应关键指标。这种结构化的方法确保了强大的可观测性,从而支持持续的性能评估和生产环境中的运行可靠性。

Table 18.3, titled, Post-development failure points and metrics for agentic RAG system, provides a detailed breakdown of potential failure points and the corresponding key metrics that should be monitored across each major component of the system. This structured approach ensures robust observability, supporting continuous performance evaluation and operational reliability in production.

运营风险分析和监控指标

Operational risk analysis and monitoring metrics

为确保开发后阶段的运行稳健性和持续性能,必须识别并持续监控代理 RAG 系统各个组件的关键故障点。从嵌入式生成和检索工具到代理编排和输出生成,每个模块都存在独特的风险,这些风险会影响系统的有效性、用户体验和整体可靠性。下表概述了这些关键故障点,并列出了应跟踪的相应指标,以便在生产环境中进行及时诊断、进行明智的优化并确保符合质量标准:

To ensure operational robustness and sustained performance in the post-development phase, it is essential to identify and continuously monitor the key failure points across the various components of the agentic RAG system. Each module, ranging from embedding generation and retrieval tools to agent orchestration and output generation, presents unique risks that can impact system effectiveness, UX, and overall reliability. The following table outlines these critical failure points and specifies the corresponding metrics that should be tracked to enable timely diagnostics, informed optimization, and adherence to quality standards in a production environment:

成分

Component

故障点

Failure points

关键指标

Key metrics

嵌入生成

Embedding generation

过时的嵌入、低质量的向量、不一致的格式。

Stale embeddings, low-quality vectors, inconsistent formats.

嵌入漂移分数、覆盖率和更新频率。

Embedding drift score, coverage ratio, and update frequency.

矢量搜索工具

Vector search tool

语义回忆率低、检索延迟高、前 k 个结果不相关。

Low semantic recall, retrieval latency, and irrelevant top-k results.

召回率@k、精确率@k、查询延迟和语义重叠分数。

Recall@k, precision@k, query latency, and semantic overlap score.

图搜索工具

Graph search tool

节点断开、遍历路径稀疏、图不匹配。

Disconnected nodes, sparse traversal paths, and graph mismatch.

节点连通性、平均路径长度和图命中率。

Node connectivity, average path length, and graph hit rate.

混合搜索工具

Hybrid search tool

融合逻辑不一致、过度拟合单一数据源、多样性低。

Inconsistent fusion logic, overfitting to one source, and low diversity.

评分一致率、结果多样性指数和检索一致性。

Score agreement rate, result diversity index, and retrieval consistency.

代理编排

Agent orchestration

工具选择无效、执行计划失败、未处理的异常。

Invalid tool selection, failed execution plans, unhandled exceptions.

工具成功率、计划执行时间和工具链准确性。

Tool success rate, plan execution time, and toolchain accuracy.

重排序(交叉编码器)

Reranking (cross-encoder)

排名错误、延迟瓶颈和不准确的重排序。

Misranking, latency bottleneck, and unfaithful reordering.

重新排序得分相关性、延迟、前 1 相关性准确率。

Rerank score correlation, latency, top-1 relevance accuracy.

LLM包装推荐

LLM wrapping for recommendation

无根据的产生、幻觉、语无伦次。

Ungrounded generation, hallucination, incoherence.

忠诚度评分、幻觉率、BERTScore。

Faithfulness score, hallucination rate, BERTScore.

端到端响应质量

End-to-end response quality

个性化程度低、参与度低、事实不一致。

Poor personalization, low engagement, and factual inconsistency.

用户评分、真实度、会话参与度。

User rating score, groundedness rate, and session engagement rate.

表 18.3:基于图的代理 RAG 系统的开发后故障点和指标

Table 18.3: Post-development failure points and metrics for graph-based agentic RAG systems

现代软件开发中各种操作的比较

Comparison of various Ops in modern software development

在现代软件系统中,DevOps、MLOps 和 RAGOps 扮演着既独特又互补的角色,当它们被整合起来时,就能实现可扩展、智能且具有弹性的应用程序。DevOps它专注于软件开发和部署工作流程的自动化,确保集成、交付、监控和基础设施管理的一致性。它为持续集成/持续交付 (CI/CD) 流水线、测试框架和系统可靠性奠定了基础。

In modern software systems, DevOps, MLOps, and RAGOps serve distinct yet complementary roles that, when integrated, enable scalable, intelligent, and resilient applications. DevOps focuses on the automation of software development and deployment workflows, ensuring consistent integration, delivery, monitoring, and infrastructure management. It lays the foundation for CI/CD pipelines, testing frameworks, and system reliability.

MLOps 将 DevOps 原则扩展到机器学习( ML ) 工作流程。它通过可复现的训练流程、数据集和模型的版本控制、自动化部署以及对模型性能的长期监控,实现模型的运维。MLOps 确保 ML 模型在部署后仍然保持可靠性、适应性和可控性。

MLOps extends DevOps principles to machine learning (ML) workflows. It enables the operationalization of models through reproducible training pipelines, versioning of datasets and models, automated deployment, and monitoring of model performance over time. MLOps ensures that ML models remain reliable, adaptive, and governed after they are deployed.

RAGOps 基于这些基础架构,为 RAG 系统提供支持,特别是那些结合了向量搜索、检索逻辑和 LLM 的系统。RAGOps 引入了基于 LLM 的应用特有的可观测性和评估挑战,例如监测接地质量、幻觉率和检索忠实度。它还解决了检索、重排序和生成组件之间的可追溯性问题。

RAGOps builds on these foundations to support RAG systems, particularly those combining vector search, retrieval logic, and LLMs. RAGOps introduces new observability and evaluation challenges unique to LLM-based applications, such as monitoring grounding quality, hallucination rates, and retrieval faithfulness. It also addresses traceability across retrieval, reranking, and generation components.

DevOps 共同确保系统稳定性,MLOps 确保模型完整性,RAGOps 确保提示级推理的可追溯性和检索质量。当它们协同运作时,便能实现 GenAI 应用的持续开发、部署和改进,从而将传统工程的可靠性与生成式智能相结合。将 MLflow 与推荐系统集成。

Together, DevOps ensures system stability, MLOps ensures model integrity, and RAGOps ensures prompt-level reasoning traceability and retrieval quality. When orchestrated cohesively, they enable the continuous development, deployment, and refinement of GenAI applications, bridging classical engineering reliability with generative intelligence. Integrating MLflow with a recommendation system.

下图展示了我们的端到端电影推荐架构:一个基于 LangGraph 的检索管道,将用户查询转换为Cypher ,在Neo4j上执行,并使用 Mistral 汇总结果;同时,附加的可观测性层将忠实度和相关性指标记录到MLflow,以进行持续的质量监控:

The following figure illustrates our end-to-end movie-recommendation architecture: a LangGraph-driven retrieval pipeline that turns user queries into Cypher, executes them on Neo4j, and summarises the results with Mistral, while an attached observability layer logs faithfulness and relevance metrics to MLflow for continuous quality monitoring:

流程图展示了一个推荐流程,包括数据输入、Neo4j 数据库、LangGraph、用于用户查询的聊天机器人交互,以及使用 MLflow 跟踪指标的可观测性流程。

图 18.2:基于可观测性的推荐系统

Figure 18.2: Observability-enabled recommendation system

安装 MLflow

Installation of MLflow

为了给检索增强型语言模型系统配备严格的实验日志记录功能,第一个前提条件是安装 MLflow,它是事实上的模型跟踪开源平台

In order to instrument retrieval-augmented language model systems with rigorous experiment logging, the first prerequisite is the installation of MLflow, the de facto open-source platform for model tracking.

使用 pip 安装 MLflow

pip install MLflow

它安装了客户端库,该库公开了核心编程接口,例如MLflow.log_param、MLflow.log_metricMLflow.start_run ,以及 GenAI 特有的扩展MLflow.evaluate ,后者实现了质量评估。这些 API 使研究人员能够以可复现、可查询的形式捕获每个实验产物(超参数、检索结果、生成的答案、评估分数)。

It installs the client library that exposes the core programmatic interface, e.g., MLflow.log_param, MLflow.log_metric, and MLflow.start_run, as well as the GenAI-specific extension MLflow.evaluate, which implements the quality assessment. These APIs enable researchers to capture every experimental artefact (hyperparameters, retrieval hits, generated answers, evaluation scores) in a reproducible, queryable form.

为了在本地可视化和比较这些运行结果,可以启动一个轻量级跟踪服务器,将结果持久化到文件系统中。如果 Python Web 服务器网关接口( WSGI ) 容器 Waitress 尚未安装,可以通过以下方式添加:

To visualise and compare such runs locally, one can launch a lightweight tracking server that persists results to the file system. If the Python Web Server Gateway Interface (WSGI) container Waitress is not already present, it may be added via:

pip install waitress

pip install waitress

随后,只需一条命令即可公开 MLflow REST 和 Web 界面:

Subsequently, a single command suffices to expose the MLflow REST and web interface:

waitress-serve --host 127.0.0.1 --port 5000 MLflow.server:app

waitress-serve --host 127.0.0.1 --port 5000 MLflow.server:app

这将启动位于http://127.0.0.1:5000 的仪表板,如图18.3所示研究人员可以从中检查每次运行的参数历史记录、指标轨迹和工件,从而完成 RAG 管道的可观测性循环:

This spins up the dashboard at http://127.0.0.1:5000, as shown in Figure 18.3, from which investigators can inspect parameter histories, metric trajectories, and artefacts for every run, thereby closing the observability loop for RAG pipelines:

MLflow 界面截图,显示了“默认实验”下的“实验”选项卡。目前尚未记录任何运行,主表格为空。该界面采用深色主题。

图 18.3:MLflow 屏幕,其中记录了各项指标

Figure 18.3: MLflow screen where metrics are logged

图 18.2所示的解决方案仅用于说明如何将基于图的推荐流程(或更广泛地说,GenAI 工作流)与 MLflow 集成,以实现实验跟踪和可观测性。详细解释如下:本书不涉及Neo4j、Text2Cypher 和 Relik命名实体模型( NER ) 等底层组件及其实现,这些内容在BPB 作者 Indrajit Kar所著的《Learn Python Generative AI, Version 2》一书中已有详尽介绍。本章仅聚焦于使用 MLflow 构建可观测性管道。

The solution depicted in Figure 18.2 serves solely to illustrate how a graph-based recommendation pipeline, or more broadly, a GenAI workflow, can be instrumented and integrated with MLflow for experiment tracking and observability. Detailed explanations of the underlying components like Neo4j, Text2Cypher, and the Relik name entity model (NER) and their implementation are beyond the scope of this book and are covered extensively in Learn Python Generative AI, Version 2 by BPB author Indrajit Kar. This current chapter focuses exclusively on the observability pipeline with MLflow.

可观测性管道

Observability pipeline

在使用 LangChain 代理和 Ollama 模型的基于图的推荐系统中,可观测性和评估在确保信任、可解释性和系统调试方面起着至关重要的作用。本研究探讨了将 MLflow 集成到此类流程中的两种不同方法:

In the context of graph-based recommendation systems using LangChain agents and Ollama models, observability and evaluation play a crucial role in ensuring trust, explainability, and system debugging. This study examines two distinct approaches for integrating MLflow into such a pipeline:

  • 一种基于自定义补丁的跟踪方法(MLflow_ollama_patch.py​​ 和main_with_m_patch.py ​​)。
  • A custom patch-based tracing approach (MLflow_ollama_patch.py and main_with_m_patch.py).
  • 通过 MLflow 的 GenAI 模块(main.py )进行基于指标的直接评估。
  • Direct metric-based evaluation via MLflow’s GenAI module (main.py).

以下列表概述了该方法的详细内容:

The following list outlines the details of the approach:

  • 方法一——基于跨度的追踪(插桩):使用自定义的MLflow w_ollama_patch.py​​ 模块,将ollama.chat()调用封装到 MLflow 的底层追踪 API 中。每次调用都会被记录为一个跨度(输入、模型、输出),从而可以对管道中的中间步骤、工具使用情况和推理链进行细粒度的观察。最适合用于审计、调试和详细的执行分析。
  • Approach 1—span-based tracing (instrumentation): Uses a custom MLflow_ollama_patch.py module to wrap ollama.chat() calls with MLflow’s low-level tracing API. Each invocation is recorded as a span (inputs, model, outputs), enabling fine-grained observability of intermediate steps, tool usage, and reasoning chains within the pipeline. Best for auditing, debugging, and detailed execution analysis.
  • 方法二——基于指标的评估(MLflow GenAI 指标):利用 MLflow 的第一方 GenAI 指标 API 来评估LLM 的最终输出。忠实度和相关性等指标由评估模型(LLM 作为评判员)计算,并以标量分数的形式记录。最适合用于 RAG 流程或摘要系统中的端到端质量监控。
  • Approach 2—metric-based evaluation (MLflow GenAI metrics): Leverages MLflow’s first-party GenAI metrics API to evaluate the final outputs of the LLM. Metrics like faithfulness and relevance are computed by an evaluator model (LLM-as-a-judge) and logged as scalar scores. Best for end-to-end quality monitoring in RAG pipelines or summarization systems.

这两种方法都用于对 LLM 支持的密码生成器和答案合成器进行检测,但它们在追踪策略、复杂性和范围方面存在根本差异。

Both approaches are used to instrument an LLM-backed Cypher generator and answer synthesizer, but they differ fundamentally in tracing strategy, complexity, and scope.

方法一

Approach 1

这种方法使用 MLflow 的底层追踪 API 对 ollama.chat() 函数进行显式插桩。MLflow_ollama_patch.py​​ 文件基于装饰器的补丁应用到 Ollama:

This approach explicitly instruments the ollama.chat() function using MLflow’s low-level tracing API. The MLflow_ollama_patch.py file applies a decorator-based patch to Ollama:

  • MLflow_ollama_patch.py​​ 模块:基于跨度的追踪

    MLflow_ollama_patch.py​​ 模块提供了一个最小化的追踪接口,它使用 MLflow 的底层 span API 来追踪所有对ollama.chat()的调用。这是通过trace_ollama_chat装饰器实现的,该装饰器封装了原始函数:

    from MLflow.tracing import trace, SpanType

    ...

    @trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)

    def trace_ollama_chat(func):

    @wraps(func)

    def wrapper(*args, **kwargs):

    ...

    span.set_inputs({"messages": kwargs["messages"], "model": model})

    ...

    span.set_outputs(response)

    返回响应

    返回包装器

    该补丁是动态应用的:

    ollama.chat = trace_ollama_chat(ollama.chat)

  • The MLflow_ollama_patch.py module: span-based tracing:

    The MLflow_ollama_patch.py module serves as a minimal tracing interface that instruments all calls to ollama.chat() using MLflow’s low-level span API. This is achieved through the trace_ollama_chat decorator, which wraps the original function:

    from MLflow.tracing import trace, SpanType

    ...

    @trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)

    def trace_ollama_chat(func):

    @wraps(func)

    def wrapper(*args, **kwargs):

    ...

    span.set_inputs({"messages": kwargs["messages"], "model": model})

    ...

    span.set_outputs(response)

    return response

    return wrapper

    The patch is applied dynamically:

    ollama.chat = trace_ollama_chat(ollama.chat)

这种方法确保每次调用 LLM 都会被记录为一个 span,其中包含完整的输入提示、模型名称和响应。这些 span 会作为执行跟踪的一部分在 MLflow UI 中可视化显示,使开发人员能够审核工具使用情况、分析提示-响应行为以及诊断函数级别的错误。

This method ensures that each invocation of the LLM is recorded as a span, complete with input prompts, model names, and responses. These spans are visualized in the MLflow UI as part of the execution trace, enabling developers to audit tool use, analyze prompt-response behavior, and diagnose errors at the function level.

在对应的main_with_m_patch.py​​ 文件中,用户无需添加除记录参数和输出之外的额外逻辑。所有 LLM 调用都会被自动跟踪。

In the corresponding main_with_m_patch.py, the user does not need to add extra logic beyond logging parameters and outputs. All LLM calls are automatically traced.

文件main_with_m_patch.py​​ 将基于 LangChain 代理的问答系统与 MLflow 跟踪功能集成在一起,同时使用自定义模块MLflow_ollama_patch对ollama.chat()调用进行底层跟踪。该脚本既能对最终答案进行语义质量评估,又能对 LLM 交互进行执行级别的跟踪。以下是详细的分解和说明:

The file main_with_m_patch.py integrates a LangChain agent-based QA system with MLflow tracking, while enabling low-level tracing of ollama.chat() calls using the custom module MLflow_ollama_patch. This script offers both semantic quality evaluation of the final answer and execution-level tracing of LLM interactions. The following is a structured breakdown and explanation:

  • 用户输入和模式设置

    用户问题 = 输入(“提出您的问题:”)

    schema = """(:Movie {title, genre, mood, release_year}) ..."""

    该脚本捕获自由文本查询,并定义知识图谱模式,概述电影、演员、导演和平台节点的结构及其关系。

  • User input and schema setup:

    user_question = input("Ask your question: ")

    schema = """(:Movie {title, genre, mood, release_year}) ..."""

    The script captures a free-text query and defines a knowledge graph schema outlining the structure of movie, actor, director, and platform nodes and their relationships.

  • LangGraph代理调用

    result = app.invoke(inputs)

    cypher_query = result.get("cypher_query", "")

    LangGraph 代理处理用户问题和模式,生成 Cypher 查询(cypher_query ),在 Neo4j 图数据库(query_results )上执行该查询,并通过 Ollama 使用 Mistral 模型生成自然语言答案(final_answer )。

  • LangGraph agent invocation:

    result = app.invoke(inputs)

    cypher_query = result.get("cypher_query", "")

    The LangGraph agent processes the user question and schema to generate a Cypher query (cypher_query), execute it on a Neo4j graph database (query_results), and generate a natural language answer (final_answer) using the Mistral model via Ollama.

  • MLflow 运行初始化和日志记录

    使用MLflow.start_run(run_name="CypherTest_Run1")作为 run:

    启动一个已命名的 MLflow 运行。在try代码块中,系统日志记录如下:

    • 参数:问题,生成的密码。
    • 工件:Neo4j 原始结果,最终答案文本:

      MLflow.log_param("question", user_question)

      MLflow.log_param("cypher_query", cypher_query or "EMPTY")

      MLflow.log_text(json.dumps(query_results, indent=2), "neo4j_context.json")

      MLflow.log_text(final_answer or "EMPTY", "final_answer.txt")

    • 使用 Ollama 进行输出评估:该系统评估生成回复的事实一致性和上下文相关性。使用基于 Ollama 的评估器,计算两个指标:

      faith_score = evaluate_faithfulness_with_ollama(...)

      rel_score = evaluate_relevance_with_ollama(...)

      MLflow.log_metric("faithfulness", faith_score)

      MLflow.log_metric("相关性", rel_score)

      这些评分衡量最终答案是否反映数据库事实(忠实度)以及是否符合用户的查询意图(相关性)。为确保稳健性,在出现异常情况时,系统会记录0.0的备用评分,从而保证 MLflow 日志的完整性和可追溯性。

      通过在 MLflow 中捕获这些指标,评估过程成为 RAGOps 可观测性管道的核心部分——从而实现对检索和生成组件的持续监控、故障诊断和数据驱动改进。

    • 使用 MLflow_ollama_patch 进行 Ollama 聊天追踪:此行激活补丁功能:

      导入 MLflow_ollama_patch

      在内部,MLflow_ollama_patch.py​​ 使用 MLflow span 日志记录器包装了ollama.chat() :

      @trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)

      def trace_ollama_chat(func):

      ...

      ollama.chat = trace_ollama_chat(ollama.chat)

      这会将每次ollama.chat()调用记录为 MLflow UI 中可追踪的聊天模型跨度,捕获以下内容:

      • 输入:系统/用户消息、模型名称
      • 输出:生成的响应

      这使得对提示演变和模型行为进行细粒度的观察成为可能,而这在标准的MLflow.log_*()调用中是不可见的。

  • MLflow run initialization and logging:

    with MLflow.start_run(run_name="CypherTest_Run1") as run:

    A named MLflow run is started. Inside the try block, the system logs:

    • Params: Question, generated Cypher.
    • Artifacts: Neo4j raw results, final answer text:

      MLflow.log_param("question", user_question)

      MLflow.log_param("cypher_query", cypher_query or "EMPTY")

      MLflow.log_text(json.dumps(query_results, indent=2), "neo4j_context.json")

      MLflow.log_text(final_answer or "EMPTY", "final_answer.txt")

    • Output evaluation using Ollama: The system evaluates both factual consistency and contextual relevance of generated responses. Using Ollama-based evaluators, two metrics are computed:

      faith_score = evaluate_faithfulness_with_ollama(...)

      rel_score = evaluate_relevance_with_ollama(...)

      MLflow.log_metric("faithfulness", faith_score)

      MLflow.log_metric("relevance", rel_score)

      These scores measure whether the final answer reflects database facts (faithfulness) and whether it aligns with the user’s query intent (relevance). To ensure robustness, fallback scores of 0.0 are recorded in case of exceptions, allowing MLflow logs to remain complete and traceable.

      By capturing these metrics within MLflow, the evaluation process becomes a core part of the RAGOps observability pipeline—enabling continuous monitoring, failure diagnosis, and data-driven improvement of both retrieval and generation components.

    • Ollama chat tracing with MLflow_ollama_patch: This line activates patching:

      import MLflow_ollama_patch

      Internally, MLflow_ollama_patch.py wraps ollama.chat() with an MLflow span logger:

      @trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)

      def trace_ollama_chat(func):

      ...

      ollama.chat = trace_ollama_chat(ollama.chat)

      This records every ollama.chat() call as a traceable chat model span in MLflow’s UI, capturing:

      • Inputs: System/User messages, model name
      • Outputs: Generated response

      This enables fine-grained observability of prompt evolution and model behavior, which is invisible in standard MLflow.log_*() calls.

以下柱状图展示了忠实度和相关性的语义评估得分,两项得分均为 3.0,表明聊天机器人生成的回复与 Neo4j 查询结果之间存在中等程度的一致性。这些指标由 MLflow 可观测性管道自动计算并记录。

The following bar charts represent semantic evaluation scores for faithfulness and relevance, both scoring 3.0, indicating moderate alignment between the chatbot’s generated response and the Neo4j query results. These metrics are automatically computed and logged via MLflow the observability pipeline.

仪表盘包含两个条形图,分别名为“忠实度”和“相关性”,两个图表的值均为 3.00。界面标题在“模型指标”选项卡下显示 mlflow 和 CypherTest_Run1。

图 18.4:MLflow UI 屏幕截图,显示了名为 CypherTest_Run1 的运行的模型级指标。

Figure 18.4: Screenshot of MLflow UI displaying model-level metrics for the run named CypherTest_Run1

方法二

Approach 2

相比之下,main.py脚本利用 MLflow 的第一方 GenAI 指标 API 来评估最终模型输出的质量,而不是追踪中间工具的使用情况。其重点在于评估 Ollama 生成的最终答案的准确性和相关性。

In contrast, the main.py script leverages MLflow’s first-party GenAI metrics API to assess the quality of final model outputs, rather than tracing intermediate tool usage. The focus is on evaluating the faithfulness and relevance of the final answer generated by Ollama.

例如:

For instance:

from MLflow.metrics.genai import faithfulness, relevance

from MLflow.metrics.genai import faithfulness, relevance

faith_score = faithfulness(model="ollama:/mistral")(

faith_score = faithfulness(model="ollama:/mistral")(

predictions=[final_answer],

predictions=[final_answer],

输入=[用户问题],

inputs=[user_question],

context=[json.dumps(query_results)]

context=[json.dumps(query_results)]

).scores[0]

).scores[0]

此调用使用第二个 LLM 来评估生成的答案是否与检索到的知识和提示一致。同样的方法也应用于relevance()指标。这些标量分数使用以下方式记录:

This call uses a second LLM to evaluate whether the generated answer is consistent with the retrieved knowledge and prompt. The same approach is applied to the relevance() metric. These scalar scores are logged using:

MLflow.log_metric("faithfulness", faith_score)

MLflow.log_metric("faithfulness", faith_score)

MLflow.log_metric("相关性", rel_score)

MLflow.log_metric("relevance", rel_score)

该方法不需要修补或自定义跨度,并且与端到端输出验证相一致,因此特别适用于 RAG 管道或摘要系统。

This method does not require patching or custom spans, and it is aligned with end-to-end output validation, making it particularly suitable for RAG pipelines or summarization systems.

当重点在于工具审计、中间推理链分析或基于跨度的机器学习可观测性时,补丁式追踪方法是最佳选择。同时,直接MLflow。metrics.genai方法适用于评估 LLM 输出的质量,尤其适用于答案可信度至关重要的 RAG 系统。对于完整的流程,这些方法可以互补,同时使用跨度和分数来实现全栈 GenAI 可观测性。

The patched tracing approach is optimal when the focus is on tool auditing, intermediate reasoning chain analysis, or span-based ML observability. Meanwhile, the direct MLflow.metrics.genai method is suitable for evaluating the quality of LLM output, especially in RAG systems where answer trustworthiness matters. For a complete pipeline, these approaches can be complementary, using both spans and scores for full-stack GenAI observability.

为了展示 MLflow 在基于图的推荐流程中提供语义评估和实验可观测性的实用性,我们记录并可视化了输出质量指标,例如忠实度相关性。这些指标通过 Ollama 使用 Mistral 进行评估计算得出,并针对每次代理运行自动记录。MLflow 控制面板左侧的名称如下所示:

To illustrate the utility of MLflow in providing semantic evaluation and experiment observability within a graph-based recommendation pipeline, we log and visualize output quality metrics such as faithfulness and relevance. These metrics are computed using the evaluation with Mistral via Ollama and automatically recorded for each agentic run. The names at the left side of MLflow dashboard, like the following:

  • 流氓鲸鱼-186
  • rogue-whale-186
  • amazing-dolphin-985
  • amazing-dolphin-985
  • 蒙面猎犬-282
  • masked-hound-282
  • 旅行蚂蚁-386
  • traveling-ant-386

这些运行名称由 MLflow 自动生成。

They are automatically generated run names by MLflow.

这些是便于理解的运行标识符,旨在帮助您在用户界面中快速区分它们。如果您没有显式地为运行设置名称,MLflow 默认会分配这些标识符。

These are human-friendly identifiers for your runs, meant to help you quickly distinguish them in the UI. MLflow assigns them by default if you do not explicitly set a name for a run.

如果您更喜欢有意义的名称(例如CypherTest_Run 1 ),您可以在代码中显式地为运行命名:

If you would prefer meaningful names (like CypherTest_Run1), you can explicitly name a run in your code:

使用MLflow.start_run(run_name="CypherTest_Run1")作为 run:

With MLflow.start_run(run_name="CypherTest_Run1") as run:

电脑屏幕上显示 MLflow Experiments 仪表板,左侧显示实验运行列表,主面板中有两个水平条形图,比较了多次运行的模型指标(如准确率和相关性)。

图 18.5:MLflow 控制面板,显示使用模型级指标对多次运行进行比较评估的结果。

Figure 18.5: MLflow dashboard displaying comparative evaluation of multiple runs using model-level metrics

图 18.5中的每个柱状图都对应一次独立的模型运行(例如awesome-roo-333 masked-hound-282 ),这些运行结果会被自动记录并用颜色编码以方便区分。忠实度和相关性图表通过 Mistral 工具捕捉聊天机器人回复与 Neo4j 查询输出之间的语义一致性。这种对比可视化有助于开发者和研究人员评估不同实验中的相对性能,从而系统地改进基于图的推荐流程。

Each bar in Figure 18.5, corresponds to a unique model run (e.g., awesome-roo-333, masked-hound-282), automatically logged and color-coded for visual differentiation. The faithfulness and relevance charts capture semantic alignment between chatbot responses and Neo4j query outputs via Mistral. This comparative visualization helps developers and researchers assess relative performance across experiments, enabling systematic refinement of graph-based recommendation pipelines.

使用本地文件系统结构排查 MLflow 故障

Troubleshooting MLflow using local filesystem structure

在 MLflow 配置为使用本地文件后端(而非远程服务器或 SQL 存储)的情况下,了解 mlruns 文件夹的目录结构对于诊断跟踪和日志记录问题至关重要。本附录概述了 MLflow 运行级别日志和元数据的结构和解释,并提供了故障排除的实用指南:

In scenarios where MLflow is configured with a local file-based backend (as opposed to a remote server or SQL store), understanding the directory layout of the mlruns folder is crucial for diagnosing tracking and logging issues. This appendix outlines the structure and interpretation of MLflow's run-level logs and metadata, and provides practical guidelines for troubleshooting:

  • MLflow跟踪目录的结构

    执行命令时:

  • Structure of the MLflow tracking directory:

    When executing the command:

ls -l mlruns/0

ls -l mlruns/0

输出结果显示默认实验(实验 ID 0 )中各个 MLflow 运行对应的子目录。每个目录名称都是一个通用唯一标识符( UUID ),代表一次特定的运行。示例列表可能如下所示:

The output displays subdirectories corresponding to individual MLflow runs within the default experiment (experiment ID 0). Each directory name is a universally unique identifier (UUID) representing a specific run. An example listing may appear as:

drwxr-xr-x 7 <您的用户名> <您的用户名>224 7月17日 15:16 068e9e07e75f40efa5c360225157a3ed

drwxr-xr-x 7 <Your user name> <Your user name>224 17 Jul 15:16 068e9e07e75f40efa5c360225157a3ed

drwxr-xr-x 7 <您的用户名> <您的用户名>224 7月17日 16:28 7776a3d0407f45b8b8cd2a92caa4645b

drwxr-xr-x 7 <Your user name> <Your user name>224 17 Jul 16:28 7776a3d0407f45b8b8cd2a92caa4645b

-rw-r--r-- 1 <您的用户名> <您的用户名>212 7月17日 15:16 meta.yaml

-rw-r--r-- 1 <Your user name> <Your user name>212 17 Jul 15:16 meta.yaml

每个子目录包含与每次运行相关的日志、参数、指标和元数据。顶层meta.yaml文件存储实验级别的信息,例如实验名称和生命周期状态。

Each subdirectory contains the logs, parameters, metrics, and metadata associated with an individual run. The top-level meta.yaml file stores experiment-level information, such as the experiment name and lifecycle status.

  • 特定运行目录的内部结构

    使用以下命令检查单个运行目录:

  • Internal structure of a specific run directory:

    Inspecting an individual run directory using:

ls -l mlruns/0/7776a3d0407f45b8b8cd2a92caa4645b/

ls -l mlruns/0/7776a3d0407f45b8b8cd2a92caa4645b/

生成如下列表:

produces a listing such as:

drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 artifacts

drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 artifacts

-rw-r--r-- 1 <您的用户名> <您的用户名>395 7月17日 16:28 meta.yaml

-rw-r--r-- 1 <Your user name> <Your user name>395 17 Jul 16:28 meta.yaml

drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 指标

drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 metrics

drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 参数

drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 params

drwxr-xr-x 6 <您的用户名> <您的用户名>192 7月17日 16:28 标签

drwxr-xr-x 6 <Your user name> <Your user name>192 17 Jul 16:28 tags

每个部件都有其独特的用途。

Each component serves a distinct purpose.

下表概述了这些核心目录和文件,并总结了它们在捕获工件、记录指标、存储参数和记录关键元数据方面的基本功能。这种基础结构确保了每次机器学习运行都采用一致且有序的方法,从而更容易审核结果、比较实验并在整个工作流程中保持可复现性。

The following table presents an overview of these core directories and files, summarizing their essential functions in capturing artifacts, logging metrics, storing parameters, and recording critical metadata. This foundational structure ensures a consistent and organized approach for every ML run, making it easier to audit results, compare experiments, and maintain reproducibility throughout the workflow.

成分

Component

描述

Description

文物/

artifacts/

存储使用MLflow.log_artifact()log_model()记录的外部文件

Stores external files logged using MLflow.log_artifact() or log_model().

指标/

metrics/

包含以单独 JSON 文件形式存储的带时间戳的指标日志。

Contains time-stamped metric logs stored as individual JSON files.

参数/

params/

包含通过MLflow.log_param()记录的键值对参数。

Contains parameters logged via MLflow.log_param() as key-value pairs.

标签/

tags/

包含元数据标签,例如运行名称、来源和用户。

Contains metadata tags such as run name, source, and user.

meta.yaml

meta.yaml

存储运行元数据,包括运行状态、开始/结束时间和用户信息。

Stores run metadata, including run status, start/end times, and user info.

表 18.4:运行目录中每个关键组件的简要说明

Table 18.4: Brief description of each key component found within a run directory

UI 中显示的运行名称(例如traveled-ant-386 )存储为标签,可以在tags/目录中找到。

The run name visible in the UI (e.g., traveling-ant-386) is stored as a tag and can be found within the tags/ directory.

  • 故障排除指南:为了解决本地 MLflow 跟踪过程中遇到的常见问题,请考虑以下几点:
    • MLflow UI 中不显示运行:请确认mlruns/<experiment_id>/目录下是否存在相应的运行目录,并且其meta.yaml文件有效且格式正确。此外,请验证日志记录期间使用的跟踪 URI(例如file:///.../mlruns )是否与 UI 服务器的 URI 匹配。
    • 缺少指标或参数:使用诸如ls mlruns/0/<run_id>/metrics/之类的命令来确认指标是否已记录并写入磁盘。这些文件的缺失通常表明MLflow.log_metric( )是在活动运行之外调用的,或者运行未使用MLflow.end_run()正确提交
    • 运行数据损坏或不完整:如果用户界面显示的信息不完整或无法加载运行数据,请检查meta.yaml文件是否存在缺失或格式错误的字段。此处的不一致可能会导致 MLflow 无法成功解析运行数据。
    • UI 显示 404 错误或“无数据” :这可能是由于服务器无法在指定的实验目录中找到run_id造成的。请仔细检查程序日志记录路径和服务器配置路径是否一致。

      本地的mlruns/目录结构提供了一种透明且易于访问的方式,以便在不使用远程后端的情况下使用 MLflow 时检查和调试实验跟踪。每个实验都通过其 ID 映射到一个目录,每次运行都存储在一个唯一命名的文件夹中,该文件夹包含标准化的子目录,分别用于存储指标、参数、工件和元数据。理解这种结构对于构建稳健的机器学习可观测性系统的开发人员至关重要,尤其是在基于 LLM 的实验性流水线中,可复现性和可追溯性是其基础。

  • Troubleshooting guidelines: To address common issues encountered during local MLflow tracking, consider the following:
    • Run not visible in the MLflow UI: Confirm that the corresponding run directory exists under mlruns/<experiment_id>/, and that its meta.yaml file is valid and properly formatted. Also, verify that the tracking URI (e.g., file:///.../mlruns) used during logging matches that of the UI server.
    • Missing metrics or parameters: Use commands such as ls mlruns/0/<run_id>/metrics/ to confirm whether metrics were logged and written to disk. Absence of these files usually indicates that MLflow.log_metric() was called either outside of an active run or the run was not properly committed using MLflow.end_run().
    • Corrupted or incomplete runs: If the UI displays incomplete information or fails to load a run, inspect meta.yaml for missing or malformed fields. Inconsistencies here can prevent MLflow from parsing the run successfully.
    • UI displays 404 or No data: This may occur when the server cannot locate the run_id within the specified experiment directory. Double-check the consistency between your programmatic logging and server configuration paths.

      The local mlruns/ directory structure provides a transparent and accessible way to inspect and debug experiment tracking when using MLflow without a remote backend. Each experiment is mapped to a directory by its ID, and every run is stored in a uniquely named folder with standardized subdirectories for metrics, parameters, artifacts, and metadata. Understanding this structure is essential for developers building robust ML observability systems, especially in the context of experimental LLM-based pipelines, where reproducibility and traceability are foundational.

通过本地文件系统结构对 MLflow 进行故障排除,可以直接了解运行、工件和元数据的组织方式。通过检查运行文件夹、日志和参数文件,用户可以快速定位问题、验证实验完整性并确保可复现性,因此,文件系统级别的探索是 MLflow 调试中切实可行的第一步。

Troubleshooting MLflow through the local filesystem structure offers direct visibility into how runs, artifacts, and metadata are organized. By inspecting run folders, logs, and parameter files, practitioners can quickly isolate issues, verify experiment integrity, and ensure reproducibility, making filesystem-level exploration a practical first step in MLflow debugging.

随着我们即将结束这段多模态生成人工智能(GenAI)之旅,我们必须认识到,尽管多模态系统代表着前沿领域,使模型能够跨文本、图像、音频等多种数据进行联合推理,但它们仍然建立在传统生成模型的坚实基础之上。如需深入了解这些基础知识,请参阅《学习Python生成人工智能:从自编码器到Transformer再到大型语言模型》,该书探讨了为当今多模态突破铺平道路的核心架构。

As we come to the end of this journey into multimodal GenAI, it is important to recognize that while multimodal systems represent the frontier, enabling models to reason jointly across text, images, audio, and beyond, they are built upon a strong foundation of traditional generative models. For a deeper dive into these fundamentals, refer to Learn Python Generative AI: Journey from Autoencoders to Transformers to Large Language Models, which explores the core architectures that paved the way for today’s multimodal breakthroughs.

结论

Conclusion

在本章乃至本书即将完结之际,我们回顾了这段探索之旅,它横跨了RAG系统运维的复杂领域以及更广泛的GenAI领域。从对MLflow中基础运行目录结构和故障排除的探索,到生产级GenAI应用中可观测性、评估和可追溯性的细微要求,每一部分都逐步构建了对稳健AI系统部署的整体理解。DevOps、MLOps和RAGOps的比较,阐明了智能系统管理范式的演变,这些范式将软件工程和生成推理交织在一起。

As we draw the final lines of this chapter, and indeed, this book, we reflect on a journey that has traversed the complex terrain of operationalizing RAG systems and the broader landscape of GenAI. From our exploration of foundational run directory structures and troubleshooting in MLflow, to the nuanced requirements of observability, evaluation, and traceability in production-grade GenAI applications, each section has built toward a holistic understanding of robust AI system deployment. The comparison of DevOps, MLOps, and RAGOps illuminated the evolving paradigms for managing intelligent systems that intertwine software engineering and generative reasoning.

本书通过包括 MLflow 工具和图增强推荐系统在内的实践案例,将理论与实践相结合,强调了可复现性、透明度和问责制这三大支柱对于人工智能驱动创新未来至关重要。综上所述,这些方法论并非仅仅是技术上的必要条件,而是构建合乎伦理且可持续的人工智能发展的基石。

The hands-on examples, including MLflow instrumentation and graph-enhanced recommender systems, rooted theory in practice, emphasizing that the pillars of reproducibility, transparency, and accountability are vital for the future of AI-driven innovation. As we close, it is clear that these methodologies are not mere technical necessities but the foundation for ethical and sustainable AI development.

指数

Index

一个

A

溯因推理 269

Abductive Reasoning 269

智能体人工智能 92

Agentic AI 92

智能体人工智能/人工智能代理, 比较127、128

Agentic AI/AI Agents, comparing 127, 128

智能体人工智能,架构 92

Agentic AI, architecture 92

智能体人工智能,术语

Agentic AI, terms

代理 SDK 93

Agents SDK 93

助手 API 94

Assistants API 94

法典 94

Codex 94

93号操作员

Operator 93

响应 API 92

Response API 92

智能体 GenAI 106

Agentic GenAI 106

智能体GenAI,模式

Agentic GenAI, pattern

聚合器 109

Aggregator 109

评论家/验证者 115

Critic/Validator 115

数据库 113

Database 113

层级 111

Hierarchical 111

人机交互 111

Human-in-the-Loop 111

循环 108

Loop 108

记忆转换 113

Memory Transformation 113

多模态代理 116

Multimodal Agent 116

谈判者 116

Negotiator 116

110号网络

Network 110

平行线 106

Parallel 106

规划执行者 114

Planner-Executor 114

路由器 109

Router 109

顺序 107

Sequential 107

共享工具 112

Shared Tools 112

主管-下属 118

Supervisor-Subordinate 118

时间规划器 120

Temporal Planner 120

投票/共识 117

Voting/Consensus 117

监督/恢复 119

Watchdog/Recovery 119

主动型RAG/非主动型 RAG 30、31

Agentic RAG/Non-Agentic RAG 30, 31

类比推理 270

Analogical Reasoning 270

自回归 第九代

Autoregressive Generation 9

自回归生成策略

Autoregressive Generation, strategies

温度 10

Temperature 10

Top-k 抽样 10

Top-k Sampling 10

顶级抽样 10

Top-p Sampling 10

B

B

双编码器/交叉编码器 24

Bi-Encoders/Cross-Encoders 24

双编码器/交叉编码器,模式

Bi-Encoders/Cross-Encoders, pattern

双编码器 24

Bi-Encoders 24

交叉编码器 24

Cross-Encoders 24

C

C

因果推理 271 , 272

Causal Reasoning 271, 272

云端LLM 189

Cloud LLMs 189

云计算LLM、概念

Cloud LLMs, concepts

法学硕士(LLM)法官 190

LLM-as-a-Judge 190

原理/功能 190

Rationale/Functionality 190

代码实现 205

Code Implementation 205

代码实现,组件

Code Implementation, components

ChromaDB 206

ChromaDB 206

配置管理 205

Configuration Management 205

数据加载器 206

Data Loaders 206

嵌入函数 205

Embedding Functions 205

代码实现,确保 206 - 208

Code Implementation, ensuring 206-208

ColBERT/ColPali 139

ColBERT/ColPali 139

ColBERT/ColPali,能力 139

ColBERT/ColPali, capabilities 139

常识 推理270、271

Commonsense Reasoning 270, 271

连续监测 404

Continuous Monitoring 404

持续监测点

Continuous Monitoring, points

异常检测 405

Anomaly Detection 405

自愈系统 406

Self-Healing Systems 406

持续监测,来源

Continuous Monitoring, sources

RAGOps 404

RAGOps 404

RAG Systems 404

RAG Systems 404

连续监测技术

Continuous Monitoring, techniques

自定义评估器 405

Custom Evaluators 405

漂移检测 405

Drift Detection 405

日志/跟踪 405

Logging/Tracing 405

可观测性平台 405

Observability Platforms 405

交叉编码器 198

Cross-Encoder 198

交叉编码器,架构 198

Cross-Encoder, architecture 198

交叉编码器 ,嵌入201,202

Cross-Encoder, embedding 201, 202

交叉编码器/后期交互 ,比较198、199

Cross-Encoder/Late Interaction, comparing 198, 199

跨模态交互 160

Cross-Modal Interaction 160

跨模态交互,功能 160 - 162

Cross-Modal Interaction, functionalities 160-162

跨模态交互 169-171

Cross-Modal Interaction, illustrating 169-171

跨模态交互术语

Cross-Modal Interaction, terms

数据目录 166

Data Directory 166

前端 163

Frontend 163

装载机 167

Loaders 167

检索系统 166

Retrieval System 166

D

D

数据可访问性 321

Data Accessibility 321

数据可访问性条款

Data Accessibility, terms

好奇号 323

Curiosity 323

数据治理/可追溯性 323

Data Governance/Traceability 323

数据素养 322

Data Literacy 322

数据民主化 322

Democratizing Data 322

全球访问 323

Global Access 323

实时决策 322

Real Time Decision-Making 322

技术桥梁 321

Technical Bridging 321

数据摄取管道 409

Data Ingestion Pipeline 409

数据摄取管道,建议 409

Data Ingestion Pipeline, recommendations 409

演绎推理 268

Deductive Reasoning 268

E

E

实体提取 317

Entity Extraction 317

实体提取,实现 318 - 321

Entity Extraction, implementing 318-321

实体提取,工作流 318

Entity Extraction, workflow 318

F

F

少镜头提示 283

Few-Shot Prompting 283

少次提示,益处 283

Few-Shot Prompting, benefits 283

少镜头提示的局限性 283

Few-Shot Prompting, limitations 283

G

G

GenAI,进展 3

GenAI, advancements 3

GenAI 代理 29

GenAI Agent 29

GenAI Agent ,确保29、30

GenAI Agent, ensuring 29, 30

GenAI 的能力

GenAI, capabilities

智能体人工智能 267

Agentic AI 267

歧义/消除歧义 265

Ambiguity/Disambiguation 265

审议 264

Deliberation 264

人机协作 267

Human-AI Collaboration 267

学习通用性 266

Learning Generalizable 266

多模式融合 265

Multimodal Integration 265

提示工程/CoT推理 266

Prompt Engineering/CoT Reasoning 266

重新排序/元推理 266

Reranking/Meta-Reasoning 266

信任度/可解释性 265

Trust/Explainability 265

GenAI,配置 2 - 4

GenAI, configuring 2-4

GenAI, 整合373、374

GenAI, integrating 373, 374

GenAI模型

GenAI, models

人工神经网络(ANN) 375

Artificial Neural Networks (ANNs) 375

分类 374

Classification 374

CNN 375

CNNs 375

预测 374

Forecasting 374

OCR 376

OCR 376

回归 374

Regression 374

分割 375

Segmentation 375

生成框架 176

Generation Framework 176

生成框架,架构 178

Generation Framework, architecture 178

生成框架,清单 179

Generation Framework, checklist 179

生成框架 确保176、177

Generation Framework, ensuring 176, 177

生成框架概述

Generation Framework, outlines

文档/图像导入 178

Document/Image Ingestion 178

嵌入模型 178

Embedding Models 178

LLM 179

LLM 179

输出交付 179

Output Delivery 179

用户查询界面 178

User Query Interface 178

向量数据库 178

Vector Database 178

向量搜索 179

Vector Search 179

第七代发电系统

Generation System 7

生成系统,架构 8

Generation System, architecture 8

发电系统技术

Generation System, techniques

扩散模型 9

Diffusion Models 9

语言模型 9

Language Models 9

视觉模型 9

Vision Models 9

发电系统类型

Generation System, types

音频世代 9

Audio Generation 9

图像生成 9

Image Generation 9

文本生成 9

Text Generation 9

生成式人工智能(GenAI) 2

Generative AI (GenAI) 2

发电机部件 180

Generator Part 180

发电机部件 来源180、181

Generator Part, sources 180, 181

遗传算法(GA) 228

Genetic Algorithms (GA) 228

GPU ,情况64、65

GPU, situations 64, 65

评分机制 143

Grading Mechanisms 143

评分机制,优势 144

Grading Mechanisms, advantages 144

分级机制,层

Grading Mechanisms, layers

自适应 RAG 143

Adaptive RAG 143

Agentic RAG 143

Agentic RAG 143

CRAG 143

CRAG 143

自残 143

Self-RAG 143

评分机制概述

Grading Mechanisms, outlines

答案质量评分器 148

Answer Quality Grader 148

幻觉检测分级器 146

Hallucination Detection Grader 146

检索相关性评分器 145

Retrieval Relevance Grader 145

图形处理单元 (GPU) 64

Graphics Processing Units (GPUs) 64

护栏 25、26

Guardrails 25, 26

护栏,框架

Guardrails, frameworks

Azure AI 提示防护 28

Azure AI Prompt Shields 28

NVIDIA NeMo 28

NVIDIA NeMo 28

OpenAI 审核 API 28

OpenAI Moderation API 28

护栏,方法 27

Guardrails, methods 27

护栏类型

Guardrails, types

输入 26

Input 26

输出 26

Output 26

H

H

HITL 配置122、123

HITL, configuring 122, 123

HITL,类型

HITL, types

端到端 124

End-to-End 124

多智能体 124

Multi-Agent 124

人机交互(HITL) 122

Human-In-The-Loop (HITL) 122

I

归纳推理 268 , 269

Inductive Reasoning 268, 269

互动 130

Interaction 130

交互类型

Interaction, types

完整 131

Full 131

132年后期

Late 132

131号

No 131

L

L

LLM评估 393

LLM Evaluation 393

LLM评估方法 394

LLM Evaluation, methods 394

LLM评估,第 393阶段

LLM Evaluation, stages 393

法学硕士 182

LLMs 182

法学硕士, 确保376、377

LLMs, ensuring 376, 377

法学硕士, 目的378、379

LLMs, purpose 378, 379

法学硕士,各部分

LLMs, sections

HistLLM 183

HistLLM 183

LLMRec 183

LLMRec 183

MMREC 183

MMREC 183

磨牙 183

Molar 183

偶然的 MLLM 183

Sarendipitous MLLM 183

LLM、用例

LLMs, use cases

基线模型开发 377

Baseline Model Development 377

数据特征 377

Data Characteristics 377

堆叠集成学习 378

Stacked Ensemble Learning 378

本地 GPU 65

Local GPU 65

本地 GPU 功能

Local GPU, capabilities

部署模式 67

Deployment Patterns 67

硬件要求 66

Hardware Requirements 66

模型文件 67

Model Files 67

性能技巧 67

Performance Tips 67

软件 67

Software 67

本地 GPU,配置 66

Local GPU, configuring 66

M

M

数学 推理274、275

Mathematical Reasoning 274, 275

密斯特拉尔 364

Mistral 364

米斯特拉尔,整合了 364

Mistral, integrating 364

MLflow 414

MLflow 414

MLflow,演示 420 - 422

MLflow, demonstrating 420-422

MLflow,确保 414

MLflow, ensuring 414

MLLM,配置 182

MLLM, configuring 182

机器学习模型集成 387

ML Model Integration 387

模型上下文协议( MCP) 31、32

Model Context Protocols (MCP) 31, 32

多文档查询 94

Multi-Document Query 94

多文档查询,初始化 94

Multi-Document Query, initializing 94

多索引嵌入 202

Multi-Index Embedding 202

多索引嵌入,配置 202 - 204

Multi-Index Embedding, configuring 202-204

多模态 GenAI 系统 40

Multimodal GenAI System 40

多模态GenAI系统,类别

Multimodal GenAI System, categories

图像系统 57

Image Systems 57

图像转文本 55

Image-to-Text 55

文字和图像 56

Text and Image 56

文本转代码 59

Text-to-Code 59

文本转图像 54

Text-to-Image 54

文本转 SQL 58

Text-to-SQL 58

多模态GenAI系统, 如图50、51所示

Multimodal GenAI System, illustrating 50, 51

多模态GenAI系统,步骤

Multimodal GenAI System, steps

嵌入生成 41

Embedding Generation 41

知识库 42

Knowledge Base 42

响应生成 42

Response Generation 42

结果返回 43

Result Returning 43

检索结果汇总 42

Retrieved Results Consolidation 42

用户查询提交 41

User Query Submission 41

向量数据库搜索 42

Vector Database Search 42

多模态LLM(MLLM) 181

Multimodal LLM (MLLM) 181

多模态 RAG 系统 229

Multimodal RAG System 229

多模态 RAG 系统,架构 236

Multimodal RAG System, architecture 236

多模式 RAG 系统,流量

Multimodal RAG System, flow

自适应嵌入 239

Adaptive Embedding 239

上下文汇编/语言生成 238

Context Assembly/Language Generation 238

索引行为 239

Indexing Behavior 239

两阶段回收 238

Two-Stage Retrieval 238

向量嵌入流程 237

Vector Embedding Pipeline 237

多模态 RAG 系统,图示 230

Multimodal RAG System, illustrating 230

多模式 RAG 系统,初始化 231 - 235

Multimodal RAG System, initializing 231-235

多模态推理 277

Multimodal Reasoning 277

多模态检索 199

Multimodal Retrieval 199

多模态检索策略 199

Multimodal Retrieval, strategies 199

多模态检索系统 156

Multimodal Retrieval System 156

多模态检索系统及其应用

Multimodal Retrieval System, applications

内容发现 160

Content Discovery 160

医学影像 159

Medical Imaging 159

多模态问答 159

Multimodal QA 159

视觉产品搜索 159

Visual Product Search 159

多模态检索系统, 架构156、157

Multimodal Retrieval System, architecture 156, 157

多模态检索系统及其组件

Multimodal Retrieval System, components

文档分块 158

Document Chunking 158

图像模态 157

Image Modalities 157

查询编码 158

Query Encoding 158

结果映射/响应生成 159

Result Mapping/Response Generation 159

技术改进 159

Technical Enhancement 159

用户交互/查询接收 157

User Interaction/Query Intake 157

向量商店集成 158

Vector Store Integration 158

多模式系统 154

Multimodal Systems 154

多模式系统,实施 155

Multimodal Systems, implementing 155

多模式系统,章节

Multimodal Systems, sections

图像到图像 155

Image-to-Image 155

图像转文本 155

Image-to-Text 155

文本转图像 154

Text-to-Image 154

文本到规格 155

Text to Specs 155

多模态向量嵌入 43

Multimodal Vector Embedding 43

多模态向量嵌入, 架构43、44

Multimodal Vector Embedding, architecture 43, 44

多模态向量嵌入,查询

Multimodal Vector Embedding, queries

多个收藏集 49

Multiple Collections 49

单册收藏 48

Single Collection 48

多模态向量嵌入解决方案

Multimodal Vector Embedding, solutions

收藏品 45

Collections 45

索引 46、47

Indexing 46, 47

多模态向量数据库 44

Multimodal Vector Database 44

有效载荷 45

Payload 45

点 ID 45

Point IDs 45

存储/矢量存储 46

Storage/Vector Store 46

向量 45

Vectors 45

多级 RAG 140

Multi-Stage RAG 140

多阶段 RAG,受益 141

Multi-Stage RAG, benefits 141

多级 RAG 组件

Multi-Stage RAG, components

混合检索 140

Hybrid Retrieval 140

迭代反馈 141

Iterative Feedback 141

多模态检索 140

Multimodal Retrieval 140

查询扩展/优化 140

Query Expansion/Refinement 140

重新排名阶段 140

Reranking Stage 140

验证/事实核查 141

Validation/Fact-Checking 141

多阶段 RAG,实施 151

Multi-Stage RAG, implementing 151

多阶段 RAG 概述

Multi-Stage RAG, outlines

自适应 142

Adaptive 142

Agentic 142

Agentic 142

分支 142

Branched 142

修正 142

Corrective 142

假设文档嵌入(HyDE) 142

Hypothetical Document Embedding (HyDE) 142

自我 142

Self 142

简单 141

Simple 141

简单记忆 141

Simple Memory 141

多阶段 RAG,阶段

Multi-Stage RAG, stage

第140

Generation 140

检索 140

Retrieval 140

多向量表示 133

Multi-Vector Representation 133

多向量表示 配置133、134

Multi-Vector Representation, configuring 133, 134

多向量表示 ,确保134、135

Multi-Vector Representation, ensuring 134, 135

O

可观测性 406

Observability 406

可观测性、平台

Observability, platforms

阿齐尔凤凰 407

Azire Phoenix 407

朗福斯 406

Langfuse 406

MLflow 407

MLflow 407

WhyLabs 407

WhyLabs 407

可观测性工具

Observability, tools

InspectorRAGet 408

InspectorRAGet 408

OTEL仪器 407

OTEL Instrumentations 407

RAGViz 407

RAGViz 407

OCR 350

OCR 350

OCR,架构 354

OCR, architecture 354

OCR概念

OCR, concepts

密斯特拉尔 364

Mistral 364

收据数据 366

Receipt Data 366

正则表达式上下文 365

Regex Context 365

OCR 配置350、351

OCR, configuring 350, 351

OCR,图示 352

OCR, illustrating 352

OCR术语

OCR, terms

生成智能 361

Generate Intelligent 361

基于图像的输入 355

Image-Based Inputs 355

购物协助 355

Shopping Assistance 355

奥拉玛 68、69

Ollama 68, 69

奥拉玛,能力

Ollama, capabilities

AutoGPTQ 69

AutoGPTQ 69

GPT4全部 69

GPT4All 69

LM Studio 69

LM Studio 69

文本生成 Web UI 69

Text Generation Web UI 69

Unsloth 69

Unsloth 69

Ollama, 实施69、70

Ollama, implementing 69, 70

Ollama 带有 PDF 文档,防止 71 - 73

Ollama With PDF Document, preventing 71-73

OpenAI 88

OpenAI 88

OpenAI API 88

OpenAI API 88

OpenAI API,类别 89

OpenAI API, categories 89

OpenAI API,功能 89

OpenAI API, functionalities 89

OpenAI API 用例

OpenAI API, use cases

访问模型 90

Accessing Models 90

OpenAI 主要模型 89

Major OpenAI Models 89

右图 91

Right Model 91

OpenAI,分析

OpenAI, breakdown

生成式响应式 186

Generative Responsive 186

分级/生成模型 189

Grading/Generation Models 189

进口声明 186

Import Statements 186

检索相关性 187

Retrieval Relevance 187

OpenAI 组件

OpenAI, components

生成反应 186

Generative Response 186

检索相关性 186

Retrieval Relevance 186

OpenAI, 确保184、185

OpenAI, ensuring 184, 185

OpenAI,历史 88

OpenAI, history 88

OpenAI,各部分

OpenAI, sections

链条组件 102

Chain Assembly 102

配置 97

Configuration 97

会话记忆 102

Conversational Memory 102

依赖关系 103

Dependencies 103

文档加载/分块 99

Document Load/Chunking 99

混合寻回犬 100

Hybrid Retriever 100

初始化嵌入 97

Initialization Embedding 97

语言模型 入门

Language Model 101

主控制器 96

Main Controller 96

元数据标记 99

Metadata Tagging 99

提示模板 101

Prompt Template 101

矢量图商店 98

Vector Store 98

管弦乐 15

Orchestration 15

管弦乐编排,术语

Orchestration, terms

智能体 系统16、17

Agentic Systems 16, 17

RAG 系统 15

RAG Systems 15

P

P

提示 10 , 282

Prompting 10, 282

提示,架构 282

Prompting, architecture 282

提示,场景 286

Prompting, scenarios 286

提示, 第284-286

Prompting, sections 284-286

提示技巧

Prompting, techniques

少枪 283

Few-Shot 283

零射击 282

Zero-Shot 282

R

RAG 应用

RAG, applications

文档质量保证系统 15

Document QA Systems 15

企业聊天机器人 15

Enterprise Chatbots 15

知识管理 15

Knowledge Management 15

个性化人工智能助手 15

Personalized AI Asssistants 15

基于 RAG 的推荐系统 408

RAG-Based Recommendation System 408

RAG, 挑战84-86

RAG, challenges 84-86

扎根感 11

Groundedness 11

幻觉 11

Hallucination 11

州知识 11

State Knowledge 11

RAG,组件 73

RAG, components 73

RAG概念

RAG, concepts

对话缓冲区内存 81

Conversation Buffer Memory 81

混合/语义搜索 79

Hybrid/Semantic Search 79

LangChain 76

LangChain 76

元数据、嵌入 77

Metadata, embeddings 77

自然语言生成 81

Natural Language Generation 81

PDF文档 75

PDF Document 75

QA链 82

QA Chain 82

ReAct 提示 81

ReAct Prompt 81

用户聊天循环 83

User Chat Loop 83

RAG 评估 394

RAG Evaluation 394

RAG 评估,原因

RAG Evaluation, cause

卓越奖 395

Distinction 395

GenAI Ops 395

GenAI Ops 395

输出质量保证 395

Output Quality Ensuring 395

RAG 评估,层

RAG Evaluation, layers

世代/扎根性 394

Generation/Groundedness 394

管道级指标 395

Pipeline-Level Metrics 395

检索质量 394

Retrieval Quality 394

RAG/LLM 评估条款

RAG/LLM Evaluation, terms

反馈回路 396

Feedback Loops 396

幻觉漂移,监测 395

Hallucinations Drift, monitoring 395

检索质量,评估 396

Retrieval Quality, evaluating 396

版本控制/可追溯性 396

Version Control/Traceability 396

RAGOps 397

RAGOps 397

RAGOps,场景

RAGOps, scenarios

在开发过程中 397

During Development 397

后期开发 400

Post-Development 400

RAG Pipeline 库

RAG Pipeline, libraries

羊驼指数 407

LlamaIndex 407

拉加斯 407

Ragas 407

RAG管道概述

RAG Pipeline, outlines

背景准备 12

Context Preparation 12

第十二

Generation 12

输出交付 12

Output Delivery 12

查询理解 12

Query Understanding 12

检索 12

Retrieval 12

RAG,步骤

RAG, steps

第十二

Generation 12

检索 11

Retrieval 11

RAG技术

RAG, techniques

记忆增强 14

Memory-Augmented 14

多模态 14

Multimodal 14

重新排名 14

Reranking 14

RAG术语

RAG, terms

迭代 13

Iterative 13

快捷工程 14

Prompt Engineering 14

向量数据库 13

Vector Databases 13

RAG,类型

RAG, types

单阶段 12

Single-Stage 12

拖车阶段 13

Tow-Stage 13

实时零售情报 330

Real-Time Retail Intelligence 330

实时零售情报概述

Real-Time Retail Intelligence, outlines

延迟决策 330

Delayed Decisions 330

查询延迟 331

Query Latency 331

收入影响 331

Revenue Impact 331

数据孤岛 330

Siloed Data 330

推理 268

Reasoning 268

推理,基准 278

Reasoning, benchmark 278

推理类型

Reasoning, types

溯因推理 269

Abductive 269

模拟 270

Analogical 270

因果关系 271

Causal 271

常识 270

Commonsense 270

演绎推理 268

Deductive 268

归纳法 268

Inductive 268

数学 274

Mathematical 274

多模态 277

Multimodal 277

空间 272

Spatial 272

时间 273

Temporal 273

基于工具的 275

Tool-Based 275

收据数据 366

Receipt Data 366

收据数据,显示 367 - 369

Receipt Data, demonstrating 367-369

建议阶段 293

Recommendation Stage 293

建议阶段,概述了 294 项内容

Recommendation Stage, outlines 294

推荐阶段,步骤 296

Recommendation Stage, steps 296

推荐阶段,工作流程 297 - 299

Recommendation Stage, workflow 297-299

正则表达式上下文 365

Regex Context 365

正则表达式上下文,图示 366

Regex Context, illustrating 366

重新排名 23、24、194、195

Reranking 23, 24, 194, 195

重新排名,建筑 195,196

Reranking, architecture 195, 196

重新排名,类别

Reranking, categories

交叉编码器 196

Cross-Encoder 196

混合动力 197

Hybrid 197

后期互动 196

Late Interaction 196

排序学习 197

Learning-to-Rank 197

基于LLM的 197

LLM-Based 197

重新排名, 如图286、287所示

Reranking, illustrating 286, 287

重新排序,模块

Reranking, module

embedding_utils.py 288

embedding_utils.py 288

index_builder.py 289

index_builder.py 289

langgraph_agent.py 290

langgraph_agent.py 290

加载器.py 287

loaders.py 287

reranker.py 289

reranker.py 289

检索增强生成(RAG) 11

Retrieval-Augmented Generation (RAG) 11

检索管道,第 409阶段

Retrieval Pipeline, phase 409

检索管道,术语

Retrieval Pipeline, terms

代理控制回路 410

Agentic Control Loop 410

Agentic RAG 设计 410

Agentic RAG Design 410

检索 系统4、5

Retrieval System 4, 5

检索系统,架构 214

Retrieval System, architecture 214

检索系统,挑战

Retrieval System, challenges

自适应指数 227

Adaptive Index 227

上下文过滤 226

Contextual Filtering 226

嵌入归一化 225

Embedding Normalization 225

遗传算法 227

Genetic Algorithms 227

基于模态的路由 224

Modality-Based Routing 224

查询扩展 224

Query Expansion 224

得分融合 226

Score Fusion 226

加权嵌入融合 225

Weighted Embedding Fusion 225

检索系统,演化

Retrieval System, evolutions

混合检索 7

Hybrid Retrieval 7

学习检索(LTR) 7

Learning-to-Retrieve (LTR) 7

记忆增强型 7

Memory-Augmented 7

多向量表示法 7

Multi-Vector Representation 7

检索器-生成器融合(RAG) 7

Retriever-Generator Fusion (RAG) 7

检索系统及其局限性

Retrieval System, limitations

情境意识 216

Contextual Awareness 216

指数陈旧度 216

Index Staleness 216

有限语义 215

Limited Semantic 215

模态不匹配 215

Modality Mismatch 215

精准权衡 215

Precision Trade-Offs 215

效率低下排名 216

Ranking Inefficiencies 216

检索系统,技术

Retrieval System, techniques

自适应指数 223

Adaptive Index 223

嵌入归一化 219

Embedding Normalization 219

混合检索 220

Hybrid Retrieval 220

基于模态的路由 218

Modality-Based Routing 218

多索引嵌入 218

Multi-Index Embedding 218

查询扩展 219

Query Expansion 219

重新排名 221

Reranking 221

检索系统,类型

Retrieval System, types

密集 5

Dense 5

稀疏 5

Sparse 5

S

S

软件开发/运维,比较 412413

Software Development/Ops, comparing 412, 413

空间推理 272 , 273

Spatial Reasoning 272, 273

基于 Streamlit 的前端 246

Streamlit-Based Frontend 246

基于 Streamlit 的前端,实现 247、248版本

Streamlit-Based Frontend, implementing 247, 248

STT/TTS 243

STT/TTS 243

STT / TTS, 集成244、245

STT/TTS, integrating 244, 245

T

T

时间推理 273 , 274

Temporal Reasoning 273, 274

文本转 SQL 302

Text-to-SQL 302

文本转 SQL 的挑战

Text-to-SQL, challenges

歧义 307

Ambiguity 307

数据隐私/治理 310

Data Privacy/Governance 310

领域泛化 308

Domain Generalization 308

反馈回路 310

Feedback Loops 310

多回合互动 309

Multi-Turn Interaction 309

查询执行 309

Query Execution 309

模式对齐/链接 308

Schema Alignment/Linking 308

SQL 语法 309

SQL Syntax 309

用户意图消歧义 310

User Intent Disambiguation 310

文本转 SQL,配置 302 - 305

Text-to-SQL, configuring 302-305

文本转 SQL,域名

Text-to-SQL, domains

商业智能/分析 305

BI/Analytics 305

对话式界面 306

Conversational Interfaces 306

金融服务/风险监控 307

Financial Services/Risk Monitoring 307

医疗保健/临床信息学 306

Healthcare/Clinical Informatics 306

人力资源 307

Human Resources 307

物联网运营 307

IoT Operations 307

运营分析 306

Operations Analytics 306

零售/电子商务个性化 306

Retail/E-commerce Personalization 306

SQL学习 306

SQL Learning 306

文本转 SQL,图示 311 - 313

Text-to-SQL, illustrating 311-313

文本到 SQL 管道 331

Text-to-SQL Pipeline 331

文本到 SQL 管道 架构334、335

Text-to-SQL Pipeline, architecture 334, 335

文本到 SQL 管道的概念

Text-to-SQL Pipeline, concepts

代理模块 336

Agent Modules 336

前端界面 338

Frontend Interface 338

索引初始化 338

Index Initialization 338

基础设施层 337

Infrastructure Layer 337

主执行层 336

Main Execution Layer 336

任务导向型 337

Task-Oriented 337

文本到 SQL 管道 嵌入341、342

Text-to-SQL Pipeline, embedding 341, 342

文本到 SQL 管道,实体

Text-to-SQL Pipeline, entity

生成 SQL 查询 343

Generate SQL Query 343

SQL 查询成绩 344

SQL Query Grade 344

总结等级 345

Summary Grade 345

文本到 SQL 管道,说明 335

Text-to-SQL Pipeline, instructions 335

文本到 SQL 管道 集成339、340

Text-to-SQL Pipeline, integrating 339, 340

文本到 SQL 管道 步骤332、333

Text-to-SQL Pipeline, steps 332, 333

文本转SQL ,实践327、328

Text-to-SQL, practices 327, 328

文本转 SQL,部分

Text-to-SQL, sections

组件级 325

Component-Level 325

精确匹配准确率 324

Exact Match Accuracy 324

执行准确率 324

Execution Accuracy 324

人类评估 326

Human Evaluation 326

查询执行成功 325

Query Execution Success 325

语义等价 326

Semantic Equivalence 326

吞吐量指标 326

Throughput Metrics 326

分词 17

Tokenization 17

代币化,确保 18

Tokenization, ensuring 18

分词类型

Tokenization, types

字节级 18

Byte-Level 18

角色等级 18

Charcter-Level 18

子词级别 18

Subword-Level 18

词汇等级 17

Word-Level 17

代币使用率 144

Token Utilization 144

代币使用情况,条款

Token Utilization, terms

输入上下文 144

Input Contexts 144

中间总结 144

Intermediate Summarization 144

长篇第 144代

Long-Form Generation 144

基于工具的推理 275 , 276

Tool-Based Reasoning 275, 276

两阶段 RAG 138

Two-Stage RAG 138

两阶段 RAG,架构 138

Two-Stage RAG, architecture 138

两阶段 RAG 的原因

Two-Stage RAG, reasons

集成鲁棒性 139

Ensemble Robustness 139

世代对齐 139

Generation Alignment 139

富达排名 第139位

Ranking Fidelity 139

两级 RAG 系统 135

Two-Stage RAG Systems 135

两阶段 RAG 术语

Two-Stage RAG, terms

一次密集检索 138

One Dense Retrievals 138

语义精确度 138

Semantic Precision 138

V

V

向量数据库 19

Vector Database 19

向量数据库,架构 20

Vector Database, architecture 20

向量数据库,确保 23

Vector Database, ensuring 23

向量数据库,操作

Vector Database, operations

嵌入模型 22

Embedding Models 22

索引算法 20

Indexing Algorithms 20

搜索算法 21

Search Algorithms 21

向量数据库 类型19、20

Vector Database, types 19, 20

视觉语言模型(VLM) 34

Vision-Language Models (VLMs) 34

VLM,架构 50

VLMs, architecture 50

VLMs,病例 52

VLMs, cases 52

VLM 的挑战

VLMs, challenges

数据要求 39

Data Requirements 39

效率/延迟 40

Efficiency/Latency 40

跨领域泛化 40

Generalization Across Domains 40

缺乏整合 40

Lack Integration 40

有限多模态推理 39

Limited Multimodal Reasoning 39

模式失衡 39

Modality Imbalance 39

VLM,确保 36 - 38

VLMs, ensuring 36-38

VLM,类型

VLMs, types

生成式 35

Generative 35

指令调整 36

Instruction-Tuned 36

多模态推理 36

Multimodal Reasoning 36

检索导向型 35

Retrieval-Focused 35

VQA/字幕 35

VQA/Captioning 35

语音赋能 管道248、249

Voice-Enabled Pipeline 248, 249

语音赋能管道,交互 249 - 253

Voice-Enabled Pipeline, interacting 249-253

语音控制的 RAG 245

Voice-Enabled RAG 245

语音赋能的 RAG 问题

Voice-Enabled RAG, concerns

基于 Streamlit 的前端 246

Streamlit-Based Frontend 246

泰克堆栈 246

Teck Stack 246

语音赋能管道 248

Voice-Enabled Pipeline 248

X

X

XGBoost Pipeline 379

XGBoost Pipeline 379

XGBoost Pipeline, 图示380、381

XGBoost Pipeline, illustrating 380, 381

XGBoost Pipeline,条款

XGBoost Pipeline, terms

代理编排 386

Agent Orchestration 386

FastAPI 推理 385

FastAPI Inference 385

FastAPI 服务层 385

FastAPI Serving Layer 385

LangChain 代码 383

LangChain Code 383

ML 后端383、384

ML Backend 383, 384

Z

Z

零样本提示 282

Zero-Shot Prompting 282

零样本提示,益处 283

Zero-Shot Prompting, benefits 283

零样本提示的局限性 283

Zero-Shot Prompting, limitations 283